<div dir="ltr">We do had the same problem in our deployment. Here is the brief description of what we saw and how we fixed it.<div><a href="http://l4tol7.blogspot.com/2013/12/openstack-rabbitmq-issues.html">http://l4tol7.blogspot.com/2013/12/openstack-rabbitmq-issues.html</a><br>
</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Dec 2, 2013 at 10:37 AM, Vishvananda Ishaya <span dir="ltr"><<a href="mailto:vishvananda@gmail.com" target="_blank">vishvananda@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>
On Nov 29, 2013, at 9:24 PM, Chris Friesen <<a href="mailto:chris.friesen@windriver.com">chris.friesen@windriver.com</a>> wrote:<br>
<br>
> On 11/29/2013 06:37 PM, David Koo wrote:<br>
>> On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote:<br>
>>> We're currently running Grizzly (going to Havana soon) and we're<br>
>>> running into an issue where if the active controller is ungracefully<br>
>>> killed then nova-compute on the compute node doesn't properly<br>
>>> connect to the new rabbitmq server on the newly-active controller<br>
>>> node.<br>
><br>
>>> Interestingly, killing and restarting nova-compute on the compute<br>
>>> node seems to work, which implies that the retry code is doing<br>
>>> something less effective than the initial startup.<br>
>>><br>
>>> Has anyone doing HA controller setups run into something similar?<br>
><br>
> As a followup, it looks like if I wait for 9 minutes or so I see a message in the compute logs:<br>
><br>
> 2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed<br>
><br>
> It then reconnects to the AMQP server and everything is fine after that. However, any instances that I tried to boot during those 9 minutes stay stuck in the "BUILD" status.<br>
><br>
><br>
>><br>
>> So the rabbitmq server and the controller are on the same node?<br>
><br>
> Yes, they are.<br>
><br>
> > My<br>
>> guess is that it's related to this bug 856764 (RabbitMQ connections<br>
>> lack heartbeat or TCP keepalives). The gist of it is that since there<br>
>> are no heartbeats between the MQ and nova-compute, if the MQ goes down<br>
>> ungracefully then nova-compute has no way of knowing. If the MQ goes<br>
>> down gracefully then the MQ clients are notified and so the problem<br>
>> doesn't arise.<br>
><br>
> Sounds about right.<br>
><br>
>> We got bitten by the same bug a while ago when our controller node<br>
>> got hard reset without any warning!. It came down to this bug (which,<br>
>> unfortunately, doesn't have a fix yet). We worked around this bug by<br>
>> implementing our own crude fix - we wrote a simple app to periodically<br>
>> check if the MQ was alive (write a short message into the MQ, then<br>
>> read it out again). When this fails n-times in a row we restart<br>
>> nova-compute. Very ugly, but it worked!<br>
><br>
> Sounds reasonable.<br>
><br>
> I did notice a kombu heartbeat change that was submitted and then backed out again because it was buggy. I guess we're still waiting on the real fix?<br>
<br>
</div></div>Hi Chris,<br>
<br>
This general problem comes up a lot, and one fix is to use keepalives. Note that more is needed if you are using multi-master rabbitmq, but for failover I have had great success with the following (also posted to the bug):<br>
<br>
When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so you can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high enough (>1.0 iirc), but rabbit needs to be specially configured to send keepalives on the connections that it creates.<br>
<br>
So solving the HA issue generally involves a rabbit config with a section like the following:<br>
<br>
[<br>
{rabbit, [{tcp_listen_options, [binary,<br>
{packet, raw},<br>
{reuseaddr, true},<br>
{backlog, 128},<br>
{nodelay, true},<br>
{exit_on_close, false},<br>
{keepalive, true}]}<br>
]}<br>
].<br>
<br>
Then you should also shorten the keepalive sysctl settings or it will still take ~2 hrs to terminate the connections:<br>
<br>
echo "5" > /proc/sys/net/ipv4/tcp_keepalive_time<br>
echo "5" > /proc/sys/net/ipv4/tcp_keepalive_probes<br>
echo "1" > /proc/sys/net/ipv4/tcp_keepalive_intvl<br>
<br>
Obviously this should be done in a sysctl config file instead of at the command line. Note that if you only want to shorten the rabbit keepalives but keep everything else as a default, you can use an LD_PRELOAD library to do so. For example you could use:<br>
<br>
<a href="https://github.com/meebey/force_bind/blob/master/README" target="_blank">https://github.com/meebey/force_bind/blob/master/README</a><br>
<br>
Vish<br>
<div class="HOEnZb"><div class="h5"><br>
><br>
> Chris<br>
><br>
><br>
> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br>
</div></div><br>_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ravi<br>
</div>