[Openstack-operators] [oslo]nova compute reconnection Issue Kilo
Ajay Kalambur (akalambu)
akalambu at cisco.com
Thu Apr 21 22:31:54 UTC 2016
See something similar with heartbeat seems like reconnection attempt fails
2016-04-21 15:27:01.294 6 DEBUG nova.openstack.common.loopingcall [req-9c9785ed-2598-4b95-a40c-307f8d7e8416 - - - - -] Dynamic looping call <bound method Service.periodic_tasks of <nova.service.Service object at 0x46a1c50>> sleeping for 60.00 seconds _inner /usr/lib/python2.7/site-packages/nova/openstack/common/loopingcall.py:132
2016-04-21 15:28:01.294 6 DEBUG nova.openstack.common.periodic_task [req-9c9785ed-2598-4b95-a40c-307f8d7e8416 - - - - -] Running periodic task ComputeManager._instance_usage_audit run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:219
2016-04-21 15:28:01.295 6 DEBUG nova.openstack.common.periodic_task [req-9c9785ed-2598-4b95-a40c-307f8d7e8416 - - - - -] Running periodic task ComputeManager._poll_rebooting_instances run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:219
2016-04-21 15:28:01.295 6 DEBUG nova.openstack.common.periodic_task [req-9c9785ed-2598-4b95-a40c-307f8d7e8416 - - - - -] Running periodic task ComputeManager._poll_volume_usage run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:219
2016-04-21 15:28:01.295 6 DEBUG nova.openstack.common.periodic_task [req-9c9785ed-2598-4b95-a40c-307f8d7e8416 - - - - -] Running periodic task ComputeManager._cleanup_running_deleted_instances run_periodic_tasks /usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py:219
2016-04-21 15:28:48.421 6 ERROR oslo_messaging._drivers.impl_rabbit [-] Declaring queue failed with (Socket closed), retrying
2016-04-21 15:28:48.422 6 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Socket closed
2016-04-21 15:28:48.422 6 ERROR oslo_messaging._drivers.amqpdriver [-] Failed to process incoming message, retrying...
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver Traceback (most recent call last):
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 228, in poll
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver self.conn.consume(limit=1)
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1194, in consume
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver six.next(it)
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 1105, in iterconsume
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver error_callback=_error_callback)
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 885, in ensure
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver 'retry': retry}
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver TypeError: %d format: a number is required, not NoneType
2016-04-21 15:28:48.422 6 TRACE oslo_messaging._drivers.amqpdriver
2016-04-21 15:28:48.430 6 ERROR oslo_messaging._drivers.impl_rabbit [-] AMQP server on 10.23.221.110:5672 is unreachable: connection already closed. Trying again in 1 seconds.
2016-04-21 15:29:01.302 6 ERROR nova.openstack.common.periodic_task [req-9c9785ed-2598-4b95-a40c-307f8d7e8416 - - - - -] Error during ComputeManager._cleanup_running_deleted_instances: Timed out waiting for a reply to message ID c0c46bd3ebfb4441981617e089c5a18d
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task Traceback (most recent call last):
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/openstack/common/periodic_task.py", line 224, in run_periodic_tasks
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task task(self, context)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6410, in _cleanup_running_deleted_instances
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task for instance in self._running_deleted_instances(context):
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6464, in _running_deleted_instances
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task instances = self._get_instances_on_driver(context, filters)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 796, in _get_instances_on_driver
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task context, filters, use_slave=True)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/objects/base.py", line 161, in wrapper
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task args, kwargs)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 335, in object_class_action
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task objver=objver, args=args, kwargs=kwargs)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task retry=self.retry)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task timeout=timeout, retry=retry)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 381, in send
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task retry=retry)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 370, in _send
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task result = self._waiter.wait(msg_id, timeout)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 274, in wait
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task message = self.waiters.get(msg_id, timeout=timeout)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 180, in get
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task 'to message ID %s' % msg_id)
2016-04-21 15:29:01.302 6 TRACE nova.openstack.common.periodic_task MessagingTimeout: Timed out waiting for a reply to message ID c0c46bd3ebfb4441981617e089c5a18d
From: Ajay Kalambur <akalambu at cisco.com<mailto:akalambu at cisco.com>>
Date: Thursday, April 21, 2016 at 12:11 PM
To: "Kris G. Lindgren" <klindgren at godaddy.com<mailto:klindgren at godaddy.com>>, "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo
Thanks Kris that’s good information will try out your suggestions
Ajay
From: "Kris G. Lindgren" <klindgren at godaddy.com<mailto:klindgren at godaddy.com>>
Date: Thursday, April 21, 2016 at 12:08 PM
To: Ajay Kalambur <akalambu at cisco.com<mailto:akalambu at cisco.com>>, "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo
We just use heartbeat. But from what I recall other people have good luck with both set. I would keep them if they are already set , maybe just dial down how aggressive they are. One thing I should mention is that if you have a large number of RPC workers, enabling heartbeats will increase cpu consumption about 1-2% per worker (in our experience). Since its now doing something with rabbitmq every few seconds. This can also increase load on the rabbitmq side as well. For us having a stable rabbit environment is well worth the tradeoff.
___________________________________________________________________
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy
From: "Ajay Kalambur (akalambu)" <akalambu at cisco.com<mailto:akalambu at cisco.com>>
Date: Thursday, April 21, 2016 at 1:04 PM
To: "Kris G. Lindgren" <klindgren at godaddy.com<mailto:klindgren at godaddy.com>>, "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo
Do you recommend both or can I do away with the system timers and just keep the heartbeat?
Ajay
From: "Kris G. Lindgren" <klindgren at godaddy.com<mailto:klindgren at godaddy.com>>
Date: Thursday, April 21, 2016 at 11:54 AM
To: Ajay Kalambur <akalambu at cisco.com<mailto:akalambu at cisco.com>>, "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo
Yea, that only fixes part of the issue. The other part is getting the openstack messaging code itself to figure out the connection its using is no longer valid. Heartbeats by itself solved 90%+ of our issues with rabbitmq and nodes being disconnected and never reconnecting.
___________________________________________________________________
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy
From: "Ajay Kalambur (akalambu)" <akalambu at cisco.com<mailto:akalambu at cisco.com>>
Date: Thursday, April 21, 2016 at 12:51 PM
To: "Kris G. Lindgren" <klindgren at godaddy.com<mailto:klindgren at godaddy.com>>, "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo
Trying that now. I had aggressive system keepalive timers before
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 5
From: "Kris G. Lindgren" <klindgren at godaddy.com<mailto:klindgren at godaddy.com>>
Date: Thursday, April 21, 2016 at 11:50 AM
To: Ajay Kalambur <akalambu at cisco.com<mailto:akalambu at cisco.com>>, "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: Re: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo
Do you have rabbitmq/oslo messaging heartbeats enabled?
If you aren't using heartbeats it will take a long time for the nova-compute agent to figure out that its actually no longer attached to anything. Heartbeat does periodic checks against rabbitmq and will catch this state and reconnect.
___________________________________________________________________
Kris Lindgren
Senior Linux Systems Engineer
GoDaddy
From: "Ajay Kalambur (akalambu)" <akalambu at cisco.com<mailto:akalambu at cisco.com>>
Date: Thursday, April 21, 2016 at 11:43 AM
To: "openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>" <openstack-operators at lists.openstack.org<mailto:openstack-operators at lists.openstack.org>>
Subject: [Openstack-operators] [oslo]nova compute reconnection Issue Kilo
Hi
I am seeing on Kilo if I bring down one contoller node sometimes some computes report down forever.
I need to restart the compute service on compute node to recover. Looks like oslo is not reconnecting in nova-compute
Here is the Trace from nova-compute
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 156, in call
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db retry=self.retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db timeout=timeout, retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db retry=retry)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in _send
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db result = self._waiter.wait(msg_id, timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 243, in wait
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db message = self.waiters.get(msg_id, timeout=timeout)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 149, in get
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db 'to message ID %s' % msg_id)
2016-04-19 20:25:39.090 6 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID e064b5f6c8244818afdc5e91fff8ebf1
Any thougths. I am at stable/kilo for oslo
Ajay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20160421/10ee1557/attachment.html>
More information about the OpenStack-operators
mailing list