<div dir="ltr">It's certainly a pain to diagnose.<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Feb 5, 2015 at 3:19 PM, Kris G. Lindgren <span dir="ltr"><<a href="mailto:klindgren@godaddy.com" target="_blank">klindgren@godaddy.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Is Mirantis going to have someone at the ops mid-cycle?  We were talking<br>

about this in the operators channel today and it seemed like pretty much<br>

everyone who was active has problems with rabbitmq.  Either from<br>

maintenance, failovers, or transient issues and having to "restart world"<br>

after rabbitmq hicups to get things to recover.  I am thinking if the<br>

issue is relatively prevalent it would be nice for people who have either<br>

"figured it out" or have something that is working to discuss their setup.<br>

 We noticed that miratis has a number of patches to oslo.messaging to fix<br>

rabbitmq specific stuff.  So I was hoping that someone could come and talk<br>

about what Mirantis has done there to make it better and if its "there<br>

yet" and if not what still needs to be done.<br>

<br>

We use clustered rabbitmq + LB and honestly this config on paper is<br>

"better" but in practice it nothing short of a nightmare.  Any maintenance<br>

done on rabbitmq (restart/patching ect ect) or the load balancer seems to<br>

cause clients to not notice that they are no longer correctly connected to<br>

the rabbitmq server and they will sit happily, doing nothing, until they<br>

are restarted. We had similar problems listing all of the rabbitmq servers<br>

out in the configuration as well.  So far my experience has been any<br>

maintenance that touches rabbitmq is going to require a restart of all<br>

service that communicate on rpc to avoid hard to troubleshoot (IE silent<br>

errors) rpc issues.<br>

<br>

In my experience rabbitmq is pretty much the #1 cause of issues in our<br>

environment and I think other operators would agree with that as well.<br>

Anything that would make rabbit + openstack more stable would be very<br>

welcome.<br>

<span class="im HOEnZb"><br>

____________________________________________<br>

<br>

Kris Lindgren<br>

Senior Linux Systems Engineer<br>

GoDaddy, LLC.<br>

<br>

<br>

</span><div class="HOEnZb"><div class="h5">On 1/20/15, 8:33 AM, "Andrew Woodward" <<a href="mailto:xarses@gmail.com">xarses@gmail.com</a>> wrote:<br>

<br>

>So this is exactly what we (@mirantis) ran into while working on the<br>

>HA story in Fuel / Mirantis OpenStack.<br>

><br>

>The short message is without heatbeat keepalive, rabbit is un-able to<br>

>properly keep track of partially open connections resulting consumers<br>

>(not senders) believing that they have a live connection to rabbit<br>

>when in-fact they don't.<br>

><br>

>Summary of the parts needed for rabbit HA<br>

>* rabbit heartbeats (<a href="https://review.openstack.org/#/c/146047/" target="_blank">https://review.openstack.org/#/c/146047/</a>) the<br>

>oslo.messaging team is working to merge this and is well aware its a<br>

>critical need for rabbit HA.<br>

>* rabbit_hosts with a list of all rabbit nodes (haproxy should be<br>

>avoided except for services that don't support rabbit_hosts [list of<br>

>servers] there are further needs to make haproxy behave properly in<br>

>HA)<br>

>* consumer_cancel_notify (CCN)b<br>

>* rabbit grater than 3.3.0<br>

><br>

>Optional:<br>

>* rip failed nodes out of amesa db. We found that rabbit node down<br>

>discovery was slower than we wanted (minutes) and we can force an<br>

>election sooner by ripping the failed node out of amesa. (in this case<br>

>Pacemaker tells us this) we have a master/slave type mechanism in our<br>

>pacemaker script to perform this.<br>

><br>

>The long message on rabbit connections.<br>

><br>

>Through a quite long process we found that due to the way rabbit uses<br>

>connection from erlang that it won't close connections, instead rabbit<br>

>(can) send a consumer cancel notification. The consumer upon receiving<br>

>this message is supposed to hang-up and reconnect. Otherwise the<br>

>connection is reaped by the linux kernel when the TCP connection<br>

>timeout is reached ( 2 Hours ). For publishers they pick up the next<br>

>time they attempt to send a message to the queue (because it's not<br>

>acknowledged) and tend to hangup and reconnect on their own.<br>

><br>

>you will observe after removing a rabbit node is that on a compute<br>

>node ~1/3 rabbit connections re-establishes to the remaining rabbit<br>

>node(s) while the other leave sockets open to the down server (using<br>

>netstat, strace, lsof)<br>

><br>

>fixes that don't work well<br>

>* turning down TCP timeouts (LDPRELOAD or system-wide). While it will<br>

>shorten from the 2 hour recovery, turning lower than 15 minutes leads<br>

>to frequent false disconnects and tends towards bad behavior<br>

>* rabbit in haproxy. This further masks the partial connection<br>

>problem. Although we stopped using it, it might be better now with<br>

>heartbeats enabled.<br>

>* script to check for partial connections in rabbit server and<br>

>forcibly close them. A partial solution that actually gets the job<br>

>done the best besides hearbeats. It some times killed innocent<br>

>connections for us.<br>

><br>

>heartbeats fixes this by running a ping/ack in a separate channel &<br>

>thread. This allows for the consumer to have a response from rabbit<br>

>that will ensure that the connections have not gone away via stale<br>

>sockets. When combined with CCN, it works in multiple failure<br>

>condtions as expected and the rabbit consumers can be healthy within 1<br>

>minute.<br>

><br>

><br>

>On Mon, Jan 19, 2015 at 2:55 PM, Gustavo Randich<br>

><<a href="mailto:gustavo.randich@gmail.com">gustavo.randich@gmail.com</a>> wrote:<br>

>> In the meantime, I'm using this horrendous script inside compute nodes<br>

>>to<br>

>> check for rabbitmq connectivity. It uses the 'set_host_enabled' rpc<br>

>>call,<br>

>> which in my case is innocuous.<br>

><br>

>This will still result in partial connections if you don't do CCN<br>

><br>

>><br>

>> #!/bin/bash<br>

>> UUID=$(cat /proc/sys/kernel/random/uuid)<br>

>> RABBIT=$(grep -Po '(?<=rabbit_host = ).+' /etc/nova/nova.conf)<br>

>> HOSTX=$(hostname)<br>

>> python -c "<br>

>> import pika<br>

>> connection =<br>

>>pika.BlockingConnection(pika.ConnectionParameters(\"$RABBIT\"))<br>

>> channel = connection.channel()<br>

>> channel.basic_publish(exchange='nova', routing_key=\"compute.$HOSTX\",<br>

>> properties=pika.BasicProperties(content_type = 'application/json'),<br>

>>     body = '{ \"version\": \"3.0\", \"_context_request_id\": \"$UUID\",<br>

>>\\<br>

>>       \"_context_roles\": [\"KeystoneAdmin\", \"KeystoneServiceAdmin\",<br>

>> \"admin\"], \\<br>

>>       \"_context_user_id\": \"XXX\", \\<br>

>>       \"_context_project_id\": \"XXX\", \\<br>

>>       \"method\": \"set_host_enabled\", \\<br>

>>       \"args\": {\"enabled\": true} \\<br>

>>     }'<br>

>> )<br>

>> connection.close()"<br>

>> sleep 2<br>

>> tail -1000 /var/log/nova/nova-compute.log | grep -q $UUID || { echo<br>

>> "WARNING: nova-compute not consuming RabbitMQ messages. Last message:<br>

>> $UUID"; exit 1; }<br>

>> echo "OK"<br>

>><br>

>><br>

>> On Thu, Jan 15, 2015 at 9:48 PM, Sam Morrison <<a href="mailto:sorrison@gmail.com">sorrison@gmail.com</a>><br>

>>wrote:<br>

>>><br>

>>> We¹ve had a lot of issues with Icehouse related to rabbitMQ. Basically<br>

>>>the<br>

>>> change from openstack.rpc to oslo.messaging broke things. These things<br>

>>>are<br>

>>> now fixed in oslo.messaging version 1.5.1, there is still an issue with<br>

>>> heartbeats and that patch is making it¹s way through review process<br>

>>>now.<br>

>>><br>

>>> <a href="https://review.openstack.org/#/c/146047/" target="_blank">https://review.openstack.org/#/c/146047/</a><br>

>>><br>

>>> Cheers,<br>

>>> Sam<br>

>>><br>

>>><br>

>>> On 16 Jan 2015, at 10:55 am, sridhar basam <<a href="mailto:sridhar.basam@gmail.com">sridhar.basam@gmail.com</a>><br>

>>> wrote:<br>

>>><br>

>>><br>

>>> If you are using ha queues, use a version of rabbitmq > 3.3.0. There<br>

>>>was a<br>

>>> change in that version where consumption on queues was automatically<br>

>>>enabled<br>

>>> when a master election for a queue happened. Previous versions only<br>

>>>informed<br>

>>> clients that they had to reconsume on a queue. It was the clients<br>

>>> responsibility to start consumption on a queue.<br>

>>><br>

>>> Make sure you enable tcp keepalives to a low enough value in case you<br>

>>>have<br>

>>> a firewall device in between your rabbit server and it's consumers.<br>

>>><br>

>>> Monitor consumers on your rabbit infrastructure using 'rabbitmqctl<br>

>>> list_queues name messages consumers'. Consumers on fanout queues is<br>

>>>going to<br>

>>> depend on the number of services of any type you have in your<br>

>>>environment.<br>

>>><br>

>>> Sri<br>

>>><br>

>>> On Jan 15, 2015 6:27 PM, "Michael Dorman" <<a href="mailto:mdorman@godaddy.com">mdorman@godaddy.com</a>> wrote:<br>

>>>><br>

>>>> Here is the bug I¹ve been tracking related to this for a while.  I<br>

>>>> haven¹t really kept up to speed with it, so I don¹t know the current<br>

>>>>status.<br>

>>>><br>

>>>> <a href="https://bugs.launchpad.net/nova/+bug/856764" target="_blank">https://bugs.launchpad.net/nova/+bug/856764</a><br>

>>>><br>

>>>><br>

>>>> From: Kris Lindgren <<a href="mailto:klindgren@godaddy.com">klindgren@godaddy.com</a>><br>

>>>> Date: Thursday, January 15, 2015 at 12:10 PM<br>

>>>> To: Gustavo Randich <<a href="mailto:gustavo.randich@gmail.com">gustavo.randich@gmail.com</a>>, OpenStack Operators<br>

>>>> <<a href="mailto:openstack-operators@lists.openstack.org">openstack-operators@lists.openstack.org</a>><br>

>>>> Subject: Re: [Openstack-operators] Way to check compute <-> rabbitmq<br>

>>>> connectivity<br>

>>>><br>

>>>> During the Atlanta ops meeting this topic came up and I specifically<br>

>>>> mentioned about adding a "no-op" or healthcheck ping to the rabbitmq<br>

>>>>stuff<br>

>>>> to both nova & neutron.  The dev's in the room looked at me like I was<br>

>>>> crazy, but it was so that we could exactly catch issues as you<br>

>>>>described.  I<br>

>>>> am also interested if any one knows of a lightweight call that could<br>

>>>>be used<br>

>>>> to verify/confirm rabbitmq connectivity as well.  I haven't been able<br>

>>>>to<br>

>>>> devote time to dig into it.  Mainly because if one client is having<br>

>>>>issues -<br>

>>>> you will notice other clients are having similar/silent errors and a<br>

>>>>restart<br>

>>>> of all the things is the easiest way to fix, for us atleast.<br>

>>>> ____________________________________________<br>

>>>><br>

>>>> Kris Lindgren<br>

>>>> Senior Linux Systems Engineer<br>

>>>> GoDaddy, LLC.<br>

>>>><br>

>>>><br>

>>>> From: Gustavo Randich <<a href="mailto:gustavo.randich@gmail.com">gustavo.randich@gmail.com</a>><br>

>>>> Date: Thursday, January 15, 2015 at 11:53 AM<br>

>>>> To: "<a href="mailto:openstack-operators@lists.openstack.org">openstack-operators@lists.openstack.org</a>"<br>

>>>> <<a href="mailto:openstack-operators@lists.openstack.org">openstack-operators@lists.openstack.org</a>><br>

>>>> Subject: Re: [Openstack-operators] Way to check compute <-> rabbitmq<br>

>>>> connectivity<br>

>>>><br>

>>>> Just to add one more background scenario, we also had similar problems<br>

>>>> trying to load balance rabbitmq via F5 Big IP LTM. For that reason we<br>

>>>>don't<br>

>>>> use it now. Our installation is a single rabbitmq instance and no<br>

>>>> intermediaries (albeit network switches). We use Folsom and Icehouse,<br>

>>>>the<br>

>>>> problem being perceived more in Icehouse nodes.<br>

>>>><br>

>>>> We are already monitoring message queue size, but we would like to<br>

>>>> pinpoint in semi-realtime the specific hosts/racks/network paths<br>

>>>> experiencing the "stale connection" before a user complains about an<br>

>>>> operation being stuck, or even hosts with no such pending operations<br>

>>>>but<br>

>>>> already "disconnected" -- we also could diagnose possible network<br>

>>>>causes and<br>

>>>> avoid massive service restarting.<br>

>>>><br>

>>>> So, for now, if someone knows about a cheap and quick openstack<br>

>>>>operation<br>

>>>> that triggers a message interchange between rabbitmq and nova-compute<br>

>>>>and a<br>

>>>> way of checking the result it would be great.<br>

>>>><br>

>>>><br>

>>>><br>

>>>><br>

>>>> On Thu, Jan 15, 2015 at 1:45 PM, Kris G. Lindgren<br>

>>>><<a href="mailto:klindgren@godaddy.com">klindgren@godaddy.com</a>><br>

>>>> wrote:<br>

>>>>><br>

>>>>> We did have an issue using celery  on an internal application that we<br>

>>>>> wrote - but I believe it was fixed after much failover testing and<br>

>>>>>code<br>

>>>>> changes.  We also use logstash via rabbitmq and haven't noticed any<br>

>>>>>issues<br>

>>>>> there either.<br>

>>>>><br>

>>>>> So this seems to be just openstack/oslo related.<br>

>>>>><br>

>>>>> We have tried a number of different configurations - all of them had<br>

>>>>> their issues.  We started out listing all the members in the cluster<br>

>>>>>on the<br>

>>>>> rabbit_hosts line.  This worked most of the time without issue,<br>

>>>>>until we<br>

>>>>> would restart one of the servers, then it seemed like the clients<br>

>>>>>wouldn't<br>

>>>>> figure out they were disconnected and reconnect to the next host.<br>

>>>>><br>

>>>>> In an attempt to solve that we moved to using harpoxy to present a<br>

>>>>>vip<br>

>>>>> that we configured in the rabbit_hosts line.  This created issues<br>

>>>>>with long<br>

>>>>> lived connections disconnects and a bunch of other issues.  In our<br>

>>>>> production environment we moved to load balanced rabbitmq, but using<br>

>>>>>a real<br>

>>>>> loadbalancer, and don¹t have the weird disconnect issues.  However,<br>

>>>>>anytime<br>

>>>>> we reboot/take down a rabbitmq host or pull a member from the<br>

>>>>>cluster we<br>

>>>>> have issues, or if their is a network disruption we also have issues.<br>

>>>>><br>

>>>>> Thinking the best course of action is to move rabbitmq off on to its<br>

>>>>>own<br>

>>>>> box and to leave it alone.<br>

>>>>><br>

>>>>> Does anyone have a rabbitmq setup that works well and doesn¹t have<br>

>>>>> random issues when pulling nodes for maintenance?<br>

>>>>> ____________________________________________<br>

>>>>><br>

>>>>> Kris Lindgren<br>

>>>>> Senior Linux Systems Engineer<br>

>>>>> GoDaddy, LLC.<br>

>>>>><br>

>>>>><br>

>>>>> From: Joe Topjian <<a href="mailto:joe@topjian.net">joe@topjian.net</a>><br>

>>>>> Date: Thursday, January 15, 2015 at 9:29 AM<br>

>>>>> To: "Kris G. Lindgren" <<a href="mailto:klindgren@godaddy.com">klindgren@godaddy.com</a>><br>

>>>>> Cc: "<a href="mailto:openstack-operators@lists.openstack.org">openstack-operators@lists.openstack.org</a>"<br>

>>>>> <<a href="mailto:openstack-operators@lists.openstack.org">openstack-operators@lists.openstack.org</a>><br>

>>>>> Subject: Re: [Openstack-operators] Way to check compute <-> rabbitmq<br>

>>>>> connectivity<br>

>>>>><br>

>>>>> Hi Kris,<br>

>>>>><br>

>>>>>>  Our experience is pretty much the same on anything that is using<br>

>>>>>> rabbitmq - not just nova-compute.<br>

>>>>><br>

>>>>><br>

>>>>> Just to clarify: have you experienced this outside of OpenStack (or<br>

>>>>> Oslo)?<br>

>>>>><br>

>>>>> We've seen similar issues with rabbitmq and OpenStack. We used to run<br>

>>>>> rabbit through haproxy and tried a myriad of options like setting no<br>

>>>>> timeouts, very very long timeouts, etc, but would always eventually<br>

>>>>>see<br>

>>>>> similar issues as described.<br>

>>>>><br>

>>>>> Last month, we reconfigured all OpenStack components to use the<br>

>>>>> `rabbit_hosts` option with all nodes in our cluster listed. So far<br>

>>>>>this has<br>

>>>>> worked well, though I probably just jinxed myself. :)<br>

>>>>><br>

>>>>> We still have other services (like Sensu) using the same rabbitmq<br>

>>>>> cluster and accessing it through haproxy. We've never had any issues<br>

>>>>>there.<br>

>>>>><br>

>>>>> What's also strange is that I have another OpenStack deployment (from<br>

>>>>> Folsom to Icehouse) with just a single rabbitmq server installed<br>

>>>>>directly on<br>

>>>>> the cloud controller (meaning: no nova-compute). I never have any<br>

>>>>>rabbit<br>

>>>>> issues in that cloud.<br>

>>>>><br>

>>>>> _______________________________________________<br>

>>>>> OpenStack-operators mailing list<br>

>>>>> <a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

>>>>><br>

>>>>><a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator</a><br>

>>>>>s<br>

>>>>><br>

>>>><br>

>>>><br>

>>>> _______________________________________________<br>

>>>> OpenStack-operators mailing list<br>

>>>> <a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

>>>><br>

>>>><a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

>>>><br>

>>> _______________________________________________<br>

>>> OpenStack-operators mailing list<br>

>>> <a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

>>><br>

>>><br>

>>><br>

>>> _______________________________________________<br>

>>> OpenStack-operators mailing list<br>

>>> <a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

>>><br>

>><br>

>><br>

>> _______________________________________________<br>

>> OpenStack-operators mailing list<br>

>> <a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

>><br>

><br>

><br>

><br>

>--<br>

>Andrew<br>

>Mirantis<br>

>Ceph community<br>

><br>

>_______________________________________________<br>

>OpenStack-operators mailing list<br>

><a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

><a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

<br>

<br>

_______________________________________________<br>

OpenStack-operators mailing list<br>

<a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

</div></div></blockquote></div><br></div>