Open Stack

Mon Mar 24 17:31:22 UTC 2014

On 03/24/2014 10:59 AM, Dan Smith wrote:
>> Any ideas on what might be going on would be appreciated.
>
> This looks like something that should be filed as a bug. I don't have
> any ideas off hand, bit I will note that the reconnection logic works
> fine for us in the upstream upgrade tests. That scenario includes
> starting up a full stack, then taking down everything except compute and
> rebuilding a new one on master. After the several minutes it takes to
> upgrade the controller services, the compute host reconnects and is
> ready to go before tempest runs.
>
> I suspect your case wedged itself somehow other than that, which
> definitely looks nasty and is worth tracking in a bug.

We've got an HA controller setup using pacemaker and were stress-testing 
it by doing multiple controlled switchovers while doing other activity. 
  Generally this works okay, but last night we ran into this problem.

I'll file a bug, but in the meantime I've found something that looks a 
bit suspicious.  The  "Unexpected exception occurred 61 time(s)... 
retrying." message comes from forever_retry_uncaught_exceptions() in 
excutils.py.  It looks like we're raising

RecoverableConnectionError: connection already closed

down in /usr/lib64/python2.7/site-packages/amqp/abstract_channel.py, but 
nothing handles it.

It looks like the most likely place that should be handling it is 
nova.openstack.common.rpc.impl_kombu.Connection.ensure().

In the current oslo.messaging code the ensure() routine explicitly 
handles connection errors (which RecoverableConnectionError is) and 
socket timeouts--the ensure() routine in Havana doesn't do this.

Chris

Open Stack

[openstack-dev] [nova] nova-compute not re-establishing connectivity after controller switchover

OpenStack

Community

Documentation

Branding & Legal