[openstack-dev] [nova] nova-compute not re-establishing connectivity after controller switchover
chris.friesen at windriver.com
Mon Mar 24 17:31:22 UTC 2014
On 03/24/2014 10:59 AM, Dan Smith wrote:
>> Any ideas on what might be going on would be appreciated.
> This looks like something that should be filed as a bug. I don't have
> any ideas off hand, bit I will note that the reconnection logic works
> fine for us in the upstream upgrade tests. That scenario includes
> starting up a full stack, then taking down everything except compute and
> rebuilding a new one on master. After the several minutes it takes to
> upgrade the controller services, the compute host reconnects and is
> ready to go before tempest runs.
> I suspect your case wedged itself somehow other than that, which
> definitely looks nasty and is worth tracking in a bug.
We've got an HA controller setup using pacemaker and were stress-testing
it by doing multiple controlled switchovers while doing other activity.
Generally this works okay, but last night we ran into this problem.
I'll file a bug, but in the meantime I've found something that looks a
bit suspicious. The "Unexpected exception occurred 61 time(s)...
retrying." message comes from forever_retry_uncaught_exceptions() in
excutils.py. It looks like we're raising
RecoverableConnectionError: connection already closed
down in /usr/lib64/python2.7/site-packages/amqp/abstract_channel.py, but
nothing handles it.
It looks like the most likely place that should be handling it is
In the current oslo.messaging code the ensure() routine explicitly
handles connection errors (which RecoverableConnectionError is) and
socket timeouts--the ensure() routine in Havana doesn't do this.
More information about the OpenStack-dev