[Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Ajay Kalambur (akalambu) akalambu at cisco.com
Thu Apr 21 18:51:48 UTC 2016


Trying that now. I had aggressive system keepalive timers before

net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 5


From: "Kris G. Lindgren" <klindgren at godaddy.com<mailto:klindgren at godaddy.com>>
Date: Thursday, April 21, 2016 at 11:50 AM
To: Ajay Kalambur <akalambu at cisco.com<mailto:akalambu at cisco.com>>, "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo

Do you have rabbitmq/oslo messaging heartbeats enabled?

If you aren't using heartbeats it will take a long time  for the nova-compute agent to figure out that its actually no longer attached to anything.  Heartbeat does periodic checks against rabbitmq and will catch this state and reconnect.

___________________________________________________________________
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy

From: "Ajay Kalambur (akalambu)" <akalambu at cisco.com<mailto:akalambu at cisco.com>>
Date: Thursday, April 21, 2016 at 11:43 AM
To: "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo


Hi
I am seeing on Kilo if I bring down one contoller node sometimes some computes report down forever.
I need to restart the compute service on compute node to recover. Looks like oslo is not reconnecting in nova-compute
Here is the Trace from nova-compute
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     retry=self.retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     timeout=timeout, retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     result = self._waiter.wait(msg_id, timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     message = self.waiters.get(msg_id, timeout=timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db     'to message ID %s' % msg_id)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID e064b5f6c8244818afdc5e91fff8ebf1


Any thougths. I am at stable/kilo for oslo

Ajay

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20160421/93ec0c99/attachment.html>


More information about the OpenStack-operators mailing list