[openstack-dev] [nova] nova-compute not re-establishing connectivity after controller switchover
Chris Friesen
chris.friesen at windriver.com
Mon Mar 24 17:31:22 UTC 2014
On 03/24/2014 10:59 AM, Dan Smith wrote:
>> Any ideas on what might be going on would be appreciated.
>
> This looks like something that should be filed as a bug. I don't have
> any ideas off hand, bit I will note that the reconnection logic works
> fine for us in the upstream upgrade tests. That scenario includes
> starting up a full stack, then taking down everything except compute and
> rebuilding a new one on master. After the several minutes it takes to
> upgrade the controller services, the compute host reconnects and is
> ready to go before tempest runs.
>
> I suspect your case wedged itself somehow other than that, which
> definitely looks nasty and is worth tracking in a bug.
We've got an HA controller setup using pacemaker and were stress-testing
it by doing multiple controlled switchovers while doing other activity.
Generally this works okay, but last night we ran into this problem.
I'll file a bug, but in the meantime I've found something that looks a
bit suspicious. The "Unexpected exception occurred 61 time(s)...
retrying." message comes from forever_retry_uncaught_exceptions() in
excutils.py. It looks like we're raising
RecoverableConnectionError: connection already closed
down in /usr/lib64/python2.7/site-packages/amqp/abstract_channel.py, but
nothing handles it.
It looks like the most likely place that should be handling it is
nova.openstack.common.rpc.impl_kombu.Connection.ensure().
In the current oslo.messaging code the ensure() routine explicitly
handles connection errors (which RecoverableConnectionError is) and
socket timeouts--the ensure() routine in Havana doesn't do this.
Chris
More information about the OpenStack-dev
mailing list