[openstack-dev] problems with rabbitmq on HA controller failure...anyone seen this?
Chris Friesen
chris.friesen at windriver.com
Sat Nov 30 05:24:07 UTC 2013
On 11/29/2013 06:37 PM, David Koo wrote:
> On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote:
>> We're currently running Grizzly (going to Havana soon) and we're
>> running into an issue where if the active controller is ungracefully
>> killed then nova-compute on the compute node doesn't properly
>> connect to the new rabbitmq server on the newly-active controller
>> node.
>> Interestingly, killing and restarting nova-compute on the compute
>> node seems to work, which implies that the retry code is doing
>> something less effective than the initial startup.
>>
>> Has anyone doing HA controller setups run into something similar?
As a followup, it looks like if I wait for 9 minutes or so I see a
message in the compute logs:
2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-]
Failed to consume message from queue: Socket closed
It then reconnects to the AMQP server and everything is fine after that.
However, any instances that I tried to boot during those 9 minutes
stay stuck in the "BUILD" status.
>
> So the rabbitmq server and the controller are on the same node?
Yes, they are.
> My
> guess is that it's related to this bug 856764 (RabbitMQ connections
> lack heartbeat or TCP keepalives). The gist of it is that since there
> are no heartbeats between the MQ and nova-compute, if the MQ goes down
> ungracefully then nova-compute has no way of knowing. If the MQ goes
> down gracefully then the MQ clients are notified and so the problem
> doesn't arise.
Sounds about right.
> We got bitten by the same bug a while ago when our controller node
> got hard reset without any warning!. It came down to this bug (which,
> unfortunately, doesn't have a fix yet). We worked around this bug by
> implementing our own crude fix - we wrote a simple app to periodically
> check if the MQ was alive (write a short message into the MQ, then
> read it out again). When this fails n-times in a row we restart
> nova-compute. Very ugly, but it worked!
Sounds reasonable.
I did notice a kombu heartbeat change that was submitted and then backed
out again because it was buggy. I guess we're still waiting on the real fix?
Chris
More information about the OpenStack-dev
mailing list