[openstack-dev] problems with rabbitmq on HA controller failure...anyone seen this?

Chris Friesen chris.friesen at windriver.com
Sat Nov 30 05:24:07 UTC 2013

On 11/29/2013 06:37 PM, David Koo wrote:
> On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote:
>> We're currently running Grizzly (going to Havana soon) and we're
>> running into an issue where if the active controller is ungracefully
>> killed then nova-compute on the compute node doesn't properly
>> connect to the new rabbitmq server on the newly-active controller
>> node.

>> Interestingly, killing and restarting nova-compute on the compute
>> node seems to work, which implies that the retry code is doing
>> something less effective than the initial startup.
>> Has anyone doing HA controller setups run into something similar?

As a followup, it looks like if I wait for 9 minutes or so I see a 
message in the compute logs:

2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-] 
Failed to consume message from queue: Socket closed

It then reconnects to the AMQP server and everything is fine after that. 
  However, any instances that I tried to boot during those 9 minutes 
stay stuck in the "BUILD" status.

>      So the rabbitmq server and the controller are on the same node?

Yes, they are.

 > My
> guess is that it's related to this bug 856764 (RabbitMQ connections
> lack heartbeat or TCP keepalives). The gist of it is that since there
> are no heartbeats between the MQ and nova-compute, if the MQ goes down
> ungracefully then nova-compute has no way of knowing. If the MQ goes
> down gracefully then the MQ clients are notified and so the problem
> doesn't arise.

Sounds about right.

>      We got bitten by the same bug a while ago when our controller node
> got hard reset without any warning!. It came down to this bug (which,
> unfortunately, doesn't have a fix yet). We worked around this bug by
> implementing our own crude fix - we wrote a simple app to periodically
> check if the MQ was alive (write a short message into the MQ, then
> read it out again). When this fails n-times in a row we restart
> nova-compute. Very ugly, but it worked!

Sounds reasonable.

I did notice a kombu heartbeat change that was submitted and then backed 
out again because it was buggy. I guess we're still waiting on the real fix?


