[openstack-dev] problems with rabbitmq on HA controller failure...anyone seen this?

David Koo kpublicmail at gmail.com
Sat Nov 30 00:37:43 UTC 2013


On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote:
> We're currently running Grizzly (going to Havana soon) and we're
> running into an issue where if the active controller is ungracefully
> killed then nova-compute on the compute node doesn't properly
> connect to the new rabbitmq server on the newly-active controller
> node.
> 
> I saw a bugfix in Folsom
> (https://bugs.launchpad.net/nova/+bug/718869) to retry the
> connection to rabbitmq if it's lost, but it doesn't seem to be
> properly handling this case.
> 
> Interestingly, killing and restarting nova-compute on the compute
> node seems to work, which implies that the retry code is doing
> something less effective than the initial startup.
> 
> Has anyone doing HA controller setups run into something similar?

    So the rabbitmq server and the controller are on the same node? My
guess is that it's related to this bug 856764 (RabbitMQ connections
lack heartbeat or TCP keepalives). The gist of it is that since there
are no heartbeats between the MQ and nova-compute, if the MQ goes down
ungracefully then nova-compute has no way of knowing. If the MQ goes
down gracefully then the MQ clients are notified and so the problem
doesn't arise.

    We got bitten by the same bug a while ago when our controller node
got hard reset without any warning!. It came down to this bug (which,
unfortunately, doesn't have a fix yet). We worked around this bug by
implementing our own crude fix - we wrote a simple app to periodically
check if the MQ was alive (write a short message into the MQ, then
read it out again). When this fails n-times in a row we restart
nova-compute. Very ugly, but it worked!

--
Koo



More information about the OpenStack-dev mailing list