Open Stack

Thu Apr 21 19:11:47 UTC 2016

Are you seeing issues only on the client side, or anything on the broker side? We were having issues with nodes not successfully reconnecting and ended up making a number of changes on the broker side to improve resiliency (upgrading to RabbitMQ 3.5.5 or higher, reducing net.ipv4.tcp_retries2 to evict failed connections faster, configuring heartbeats in RabbitMQ to detect failed clients more quickly).

From: akalambu at cisco.com 
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Do you recommend both or can I do away with the system timers and just keep the heartbeat? 
Ajay 

From: "Kris G. Lindgren" <klindgren at godaddy.com>
Date: Thursday, April 21, 2016 at 11:54 AM
To: Ajay Kalambur <akalambu at cisco.com>, "openstack-operators at lists.openstack.org" <openstack-operators at lists.openstack.org>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Yea, that only fixes part of the issue.  The other part is getting the openstack messaging code itself to figure out the connection its using is no longer valid.  Heartbeats by itself solved 90%+ of our issues with rabbitmq and nodes being disconnected  and never reconnecting. 

___________________________________________________________________ 
Kris Lindgren 
Senior Linux Systems Engineer 
GoDaddy 

From: "Ajay Kalambur (akalambu)" <akalambu at cisco.com>
Date: Thursday, April 21, 2016 at 12:51 PM
To: "Kris G. Lindgren" <klindgren at godaddy.com>, "openstack-operators at lists.openstack.org" <openstack-operators at lists.openstack.org>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Trying that now. I had aggressive system keepalive timers before 

net.ipv4.tcp_keepalive_intvl = 10 
net.ipv4.tcp_keepalive_probes = 9 
net.ipv4.tcp_keepalive_time = 5 

From: "Kris G. Lindgren" <klindgren at godaddy.com>
Date: Thursday, April 21, 2016 at 11:50 AM
To: Ajay Kalambur <akalambu at cisco.com>, "openstack-operators at lists.openstack.org" <openstack-operators at lists.openstack.org>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Do you have rabbitmq/oslo messaging heartbeats enabled? 

If you aren't using heartbeats it will take a long time  for the nova-compute agent to figure out that its actually no longer attached to anything.  Heartbeat does periodic checks against rabbitmq and will catch this state and reconnect.  

___________________________________________________________________ 
Kris Lindgren 
Senior Linux Systems Engineer 
GoDaddy 

From: "Ajay Kalambur (akalambu)" <akalambu at cisco.com>
Date: Thursday, April 21, 2016 at 11:43 AM
To: "openstack-operators at lists.openstack.org" <openstack-operators at lists.openstack.org>
Subject: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Hi 
I am seeing on Kilo if I bring down one contoller node sometimes some computes report down forever. 
I need to restart the compute service on compute node to recover. Looks like oslo is not reconnecting in nova-compute 
Here is the Trace from nova-compute 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     retry=self.retry) 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     timeout=timeout, retry=retry) 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     retry=retry) 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     result = self._waiter.wait(msg_id, timeout) 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     message = self.waiters.get(msg_id, timeout=timeout) 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     'to message ID %s' % msg_id) 
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID e064b5f6c8244818afdc5e91fff8ebf1 

Any thougths. I am at stable/kilo for oslo 

Ajay 

           _______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20160421/bd0fc922/attachment-0001.html>

Open Stack

[Openstack-operators] [oslo]nova compute reconnection Issue Kilo

OpenStack

Community

Documentation

Branding & Legal