<div dir="ltr">We do had the same problem in our deployment.  Here is the brief description of what we saw and how we fixed it.<div><a href="http://l4tol7.blogspot.com/2013/12/openstack-rabbitmq-issues.html">http://l4tol7.blogspot.com/2013/12/openstack-rabbitmq-issues.html</a><br>


</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Dec 2, 2013 at 10:37 AM, Vishvananda Ishaya <span dir="ltr"><<a href="mailto:vishvananda@gmail.com" target="_blank">vishvananda@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>

On Nov 29, 2013, at 9:24 PM, Chris Friesen <<a href="mailto:chris.friesen@windriver.com">chris.friesen@windriver.com</a>> wrote:<br>

<br>

> On 11/29/2013 06:37 PM, David Koo wrote:<br>

>> On Nov 29, 02:22:17 PM (Friday), Chris Friesen wrote:<br>

>>> We're currently running Grizzly (going to Havana soon) and we're<br>

>>> running into an issue where if the active controller is ungracefully<br>

>>> killed then nova-compute on the compute node doesn't properly<br>

>>> connect to the new rabbitmq server on the newly-active controller<br>

>>> node.<br>

><br>

>>> Interestingly, killing and restarting nova-compute on the compute<br>

>>> node seems to work, which implies that the retry code is doing<br>

>>> something less effective than the initial startup.<br>

>>><br>

>>> Has anyone doing HA controller setups run into something similar?<br>

><br>

> As a followup, it looks like if I wait for 9 minutes or so I see a message in the compute logs:<br>

><br>

> 2013-11-30 00:02:14.756 1246 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed<br>

><br>

> It then reconnects to the AMQP server and everything is fine after that.  However, any instances that I tried to boot during those 9 minutes stay stuck in the "BUILD" status.<br>

><br>

><br>

>><br>

>>     So the rabbitmq server and the controller are on the same node?<br>

><br>

> Yes, they are.<br>

><br>

> > My<br>

>> guess is that it's related to this bug 856764 (RabbitMQ connections<br>

>> lack heartbeat or TCP keepalives). The gist of it is that since there<br>

>> are no heartbeats between the MQ and nova-compute, if the MQ goes down<br>

>> ungracefully then nova-compute has no way of knowing. If the MQ goes<br>

>> down gracefully then the MQ clients are notified and so the problem<br>

>> doesn't arise.<br>

><br>

> Sounds about right.<br>

><br>

>>     We got bitten by the same bug a while ago when our controller node<br>

>> got hard reset without any warning!. It came down to this bug (which,<br>

>> unfortunately, doesn't have a fix yet). We worked around this bug by<br>

>> implementing our own crude fix - we wrote a simple app to periodically<br>

>> check if the MQ was alive (write a short message into the MQ, then<br>

>> read it out again). When this fails n-times in a row we restart<br>

>> nova-compute. Very ugly, but it worked!<br>

><br>

> Sounds reasonable.<br>

><br>

> I did notice a kombu heartbeat change that was submitted and then backed out again because it was buggy. I guess we're still waiting on the real fix?<br>

<br>

</div></div>Hi Chris,<br>

<br>

This general problem comes up a lot, and one fix is to use keepalives. Note that more is needed if you are using multi-master rabbitmq, but for failover I have had great success with the following (also posted to the bug):<br>


<br>

When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so you can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high enough (>1.0 iirc), but rabbit needs to be specially configured to send keepalives on the connections that it creates.<br>


<br>

So solving the HA issue generally involves a rabbit config with a section like the following:<br>

<br>

[<br>

 {rabbit, [{tcp_listen_options, [binary,<br>

                                {packet, raw},<br>

                                {reuseaddr, true},<br>

                                {backlog, 128},<br>

                                {nodelay, true},<br>

                                {exit_on_close, false},<br>

                                {keepalive, true}]}<br>

          ]}<br>

].<br>

<br>

Then you should also shorten the keepalive sysctl settings or it will still take ~2 hrs to terminate the connections:<br>

<br>

echo "5" > /proc/sys/net/ipv4/tcp_keepalive_time<br>

echo "5" > /proc/sys/net/ipv4/tcp_keepalive_probes<br>

echo "1" > /proc/sys/net/ipv4/tcp_keepalive_intvl<br>

<br>

Obviously this should be done in a sysctl config file instead of at the command line. Note that if you only want to shorten the rabbit keepalives but keep everything else as a default, you can use an LD_PRELOAD library to do so. For example you could use:<br>


<br>

<a href="https://github.com/meebey/force_bind/blob/master/README" target="_blank">https://github.com/meebey/force_bind/blob/master/README</a><br>

<br>

Vish<br>

<div class="HOEnZb"><div class="h5"><br>

><br>

> Chris<br>

><br>

><br>

> _______________________________________________<br>

> OpenStack-dev mailing list<br>

> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<br>

</div></div><br>_______________________________________________<br>

OpenStack-dev mailing list<br>

<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>Ravi<br>

</div>