[Openstack-operators] Way to check compute <-> rabbitmq connectivity

matt matt at nycresistor.com
Thu Feb 5 20:32:06 UTC 2015


It's certainly a pain to diagnose.

On Thu, Feb 5, 2015 at 3:19 PM, Kris G. Lindgren <klindgren at godaddy.com>
wrote:

> Is Mirantis going to have someone at the ops mid-cycle?  We were talking
> about this in the operators channel today and it seemed like pretty much
> everyone who was active has problems with rabbitmq.  Either from
> maintenance, failovers, or transient issues and having to "restart world"
> after rabbitmq hicups to get things to recover.  I am thinking if the
> issue is relatively prevalent it would be nice for people who have either
> "figured it out" or have something that is working to discuss their setup.
>  We noticed that miratis has a number of patches to oslo.messaging to fix
> rabbitmq specific stuff.  So I was hoping that someone could come and talk
> about what Mirantis has done there to make it better and if its "there
> yet" and if not what still needs to be done.
>
> We use clustered rabbitmq + LB and honestly this config on paper is
> "better" but in practice it nothing short of a nightmare.  Any maintenance
> done on rabbitmq (restart/patching ect ect) or the load balancer seems to
> cause clients to not notice that they are no longer correctly connected to
> the rabbitmq server and they will sit happily, doing nothing, until they
> are restarted. We had similar problems listing all of the rabbitmq servers
> out in the configuration as well.  So far my experience has been any
> maintenance that touches rabbitmq is going to require a restart of all
> service that communicate on rpc to avoid hard to troubleshoot (IE silent
> errors) rpc issues.
>
> In my experience rabbitmq is pretty much the #1 cause of issues in our
> environment and I think other operators would agree with that as well.
> Anything that would make rabbit + openstack more stable would be very
> welcome.
>
> ____________________________________________
>
> Kris Lindgren
> Senior Linux Systems Engineer
> GoDaddy, LLC.
>
>
> On 1/20/15, 8:33 AM, "Andrew Woodward" <xarses at gmail.com> wrote:
>
> >So this is exactly what we (@mirantis) ran into while working on the
> >HA story in Fuel / Mirantis OpenStack.
> >
> >The short message is without heatbeat keepalive, rabbit is un-able to
> >properly keep track of partially open connections resulting consumers
> >(not senders) believing that they have a live connection to rabbit
> >when in-fact they don't.
> >
> >Summary of the parts needed for rabbit HA
> >* rabbit heartbeats (https://review.openstack.org/#/c/146047/) the
> >oslo.messaging team is working to merge this and is well aware its a
> >critical need for rabbit HA.
> >* rabbit_hosts with a list of all rabbit nodes (haproxy should be
> >avoided except for services that don't support rabbit_hosts [list of
> >servers] there are further needs to make haproxy behave properly in
> >HA)
> >* consumer_cancel_notify (CCN)b
> >* rabbit grater than 3.3.0
> >
> >Optional:
> >* rip failed nodes out of amesa db. We found that rabbit node down
> >discovery was slower than we wanted (minutes) and we can force an
> >election sooner by ripping the failed node out of amesa. (in this case
> >Pacemaker tells us this) we have a master/slave type mechanism in our
> >pacemaker script to perform this.
> >
> >The long message on rabbit connections.
> >
> >Through a quite long process we found that due to the way rabbit uses
> >connection from erlang that it won't close connections, instead rabbit
> >(can) send a consumer cancel notification. The consumer upon receiving
> >this message is supposed to hang-up and reconnect. Otherwise the
> >connection is reaped by the linux kernel when the TCP connection
> >timeout is reached ( 2 Hours ). For publishers they pick up the next
> >time they attempt to send a message to the queue (because it's not
> >acknowledged) and tend to hangup and reconnect on their own.
> >
> >you will observe after removing a rabbit node is that on a compute
> >node ~1/3 rabbit connections re-establishes to the remaining rabbit
> >node(s) while the other leave sockets open to the down server (using
> >netstat, strace, lsof)
> >
> >fixes that don't work well
> >* turning down TCP timeouts (LDPRELOAD or system-wide). While it will
> >shorten from the 2 hour recovery, turning lower than 15 minutes leads
> >to frequent false disconnects and tends towards bad behavior
> >* rabbit in haproxy. This further masks the partial connection
> >problem. Although we stopped using it, it might be better now with
> >heartbeats enabled.
> >* script to check for partial connections in rabbit server and
> >forcibly close them. A partial solution that actually gets the job
> >done the best besides hearbeats. It some times killed innocent
> >connections for us.
> >
> >heartbeats fixes this by running a ping/ack in a separate channel &
> >thread. This allows for the consumer to have a response from rabbit
> >that will ensure that the connections have not gone away via stale
> >sockets. When combined with CCN, it works in multiple failure
> >condtions as expected and the rabbit consumers can be healthy within 1
> >minute.
> >
> >
> >On Mon, Jan 19, 2015 at 2:55 PM, Gustavo Randich
> ><gustavo.randich at gmail.com> wrote:
> >> In the meantime, I'm using this horrendous script inside compute nodes
> >>to
> >> check for rabbitmq connectivity. It uses the 'set_host_enabled' rpc
> >>call,
> >> which in my case is innocuous.
> >
> >This will still result in partial connections if you don't do CCN
> >
> >>
> >> #!/bin/bash
> >> UUID=$(cat /proc/sys/kernel/random/uuid)
> >> RABBIT=$(grep -Po '(?<=rabbit_host = ).+' /etc/nova/nova.conf)
> >> HOSTX=$(hostname)
> >> python -c "
> >> import pika
> >> connection =
> >>pika.BlockingConnection(pika.ConnectionParameters(\"$RABBIT\"))
> >> channel = connection.channel()
> >> channel.basic_publish(exchange='nova', routing_key=\"compute.$HOSTX\",
> >> properties=pika.BasicProperties(content_type = 'application/json'),
> >>     body = '{ \"version\": \"3.0\", \"_context_request_id\": \"$UUID\",
> >>\\
> >>       \"_context_roles\": [\"KeystoneAdmin\", \"KeystoneServiceAdmin\",
> >> \"admin\"], \\
> >>       \"_context_user_id\": \"XXX\", \\
> >>       \"_context_project_id\": \"XXX\", \\
> >>       \"method\": \"set_host_enabled\", \\
> >>       \"args\": {\"enabled\": true} \\
> >>     }'
> >> )
> >> connection.close()"
> >> sleep 2
> >> tail -1000 /var/log/nova/nova-compute.log | grep -q $UUID || { echo
> >> "WARNING: nova-compute not consuming RabbitMQ messages. Last message:
> >> $UUID"; exit 1; }
> >> echo "OK"
> >>
> >>
> >> On Thu, Jan 15, 2015 at 9:48 PM, Sam Morrison <sorrison at gmail.com>
> >>wrote:
> >>>
> >>> We¹ve had a lot of issues with Icehouse related to rabbitMQ. Basically
> >>>the
> >>> change from openstack.rpc to oslo.messaging broke things. These things
> >>>are
> >>> now fixed in oslo.messaging version 1.5.1, there is still an issue with
> >>> heartbeats and that patch is making it¹s way through review process
> >>>now.
> >>>
> >>> https://review.openstack.org/#/c/146047/
> >>>
> >>> Cheers,
> >>> Sam
> >>>
> >>>
> >>> On 16 Jan 2015, at 10:55 am, sridhar basam <sridhar.basam at gmail.com>
> >>> wrote:
> >>>
> >>>
> >>> If you are using ha queues, use a version of rabbitmq > 3.3.0. There
> >>>was a
> >>> change in that version where consumption on queues was automatically
> >>>enabled
> >>> when a master election for a queue happened. Previous versions only
> >>>informed
> >>> clients that they had to reconsume on a queue. It was the clients
> >>> responsibility to start consumption on a queue.
> >>>
> >>> Make sure you enable tcp keepalives to a low enough value in case you
> >>>have
> >>> a firewall device in between your rabbit server and it's consumers.
> >>>
> >>> Monitor consumers on your rabbit infrastructure using 'rabbitmqctl
> >>> list_queues name messages consumers'. Consumers on fanout queues is
> >>>going to
> >>> depend on the number of services of any type you have in your
> >>>environment.
> >>>
> >>> Sri
> >>>
> >>> On Jan 15, 2015 6:27 PM, "Michael Dorman" <mdorman at godaddy.com> wrote:
> >>>>
> >>>> Here is the bug I¹ve been tracking related to this for a while.  I
> >>>> haven¹t really kept up to speed with it, so I don¹t know the current
> >>>>status.
> >>>>
> >>>> https://bugs.launchpad.net/nova/+bug/856764
> >>>>
> >>>>
> >>>> From: Kris Lindgren <klindgren at godaddy.com>
> >>>> Date: Thursday, January 15, 2015 at 12:10 PM
> >>>> To: Gustavo Randich <gustavo.randich at gmail.com>, OpenStack Operators
> >>>> <openstack-operators at lists.openstack.org>
> >>>> Subject: Re: [Openstack-operators] Way to check compute <-> rabbitmq
> >>>> connectivity
> >>>>
> >>>> During the Atlanta ops meeting this topic came up and I specifically
> >>>> mentioned about adding a "no-op" or healthcheck ping to the rabbitmq
> >>>>stuff
> >>>> to both nova & neutron.  The dev's in the room looked at me like I was
> >>>> crazy, but it was so that we could exactly catch issues as you
> >>>>described.  I
> >>>> am also interested if any one knows of a lightweight call that could
> >>>>be used
> >>>> to verify/confirm rabbitmq connectivity as well.  I haven't been able
> >>>>to
> >>>> devote time to dig into it.  Mainly because if one client is having
> >>>>issues -
> >>>> you will notice other clients are having similar/silent errors and a
> >>>>restart
> >>>> of all the things is the easiest way to fix, for us atleast.
> >>>> ____________________________________________
> >>>>
> >>>> Kris Lindgren
> >>>> Senior Linux Systems Engineer
> >>>> GoDaddy, LLC.
> >>>>
> >>>>
> >>>> From: Gustavo Randich <gustavo.randich at gmail.com>
> >>>> Date: Thursday, January 15, 2015 at 11:53 AM
> >>>> To: "openstack-operators at lists.openstack.org"
> >>>> <openstack-operators at lists.openstack.org>
> >>>> Subject: Re: [Openstack-operators] Way to check compute <-> rabbitmq
> >>>> connectivity
> >>>>
> >>>> Just to add one more background scenario, we also had similar problems
> >>>> trying to load balance rabbitmq via F5 Big IP LTM. For that reason we
> >>>>don't
> >>>> use it now. Our installation is a single rabbitmq instance and no
> >>>> intermediaries (albeit network switches). We use Folsom and Icehouse,
> >>>>the
> >>>> problem being perceived more in Icehouse nodes.
> >>>>
> >>>> We are already monitoring message queue size, but we would like to
> >>>> pinpoint in semi-realtime the specific hosts/racks/network paths
> >>>> experiencing the "stale connection" before a user complains about an
> >>>> operation being stuck, or even hosts with no such pending operations
> >>>>but
> >>>> already "disconnected" -- we also could diagnose possible network
> >>>>causes and
> >>>> avoid massive service restarting.
> >>>>
> >>>> So, for now, if someone knows about a cheap and quick openstack
> >>>>operation
> >>>> that triggers a message interchange between rabbitmq and nova-compute
> >>>>and a
> >>>> way of checking the result it would be great.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Jan 15, 2015 at 1:45 PM, Kris G. Lindgren
> >>>><klindgren at godaddy.com>
> >>>> wrote:
> >>>>>
> >>>>> We did have an issue using celery  on an internal application that we
> >>>>> wrote - but I believe it was fixed after much failover testing and
> >>>>>code
> >>>>> changes.  We also use logstash via rabbitmq and haven't noticed any
> >>>>>issues
> >>>>> there either.
> >>>>>
> >>>>> So this seems to be just openstack/oslo related.
> >>>>>
> >>>>> We have tried a number of different configurations - all of them had
> >>>>> their issues.  We started out listing all the members in the cluster
> >>>>>on the
> >>>>> rabbit_hosts line.  This worked most of the time without issue,
> >>>>>until we
> >>>>> would restart one of the servers, then it seemed like the clients
> >>>>>wouldn't
> >>>>> figure out they were disconnected and reconnect to the next host.
> >>>>>
> >>>>> In an attempt to solve that we moved to using harpoxy to present a
> >>>>>vip
> >>>>> that we configured in the rabbit_hosts line.  This created issues
> >>>>>with long
> >>>>> lived connections disconnects and a bunch of other issues.  In our
> >>>>> production environment we moved to load balanced rabbitmq, but using
> >>>>>a real
> >>>>> loadbalancer, and don¹t have the weird disconnect issues.  However,
> >>>>>anytime
> >>>>> we reboot/take down a rabbitmq host or pull a member from the
> >>>>>cluster we
> >>>>> have issues, or if their is a network disruption we also have issues.
> >>>>>
> >>>>> Thinking the best course of action is to move rabbitmq off on to its
> >>>>>own
> >>>>> box and to leave it alone.
> >>>>>
> >>>>> Does anyone have a rabbitmq setup that works well and doesn¹t have
> >>>>> random issues when pulling nodes for maintenance?
> >>>>> ____________________________________________
> >>>>>
> >>>>> Kris Lindgren
> >>>>> Senior Linux Systems Engineer
> >>>>> GoDaddy, LLC.
> >>>>>
> >>>>>
> >>>>> From: Joe Topjian <joe at topjian.net>
> >>>>> Date: Thursday, January 15, 2015 at 9:29 AM
> >>>>> To: "Kris G. Lindgren" <klindgren at godaddy.com>
> >>>>> Cc: "openstack-operators at lists.openstack.org"
> >>>>> <openstack-operators at lists.openstack.org>
> >>>>> Subject: Re: [Openstack-operators] Way to check compute <-> rabbitmq
> >>>>> connectivity
> >>>>>
> >>>>> Hi Kris,
> >>>>>
> >>>>>>  Our experience is pretty much the same on anything that is using
> >>>>>> rabbitmq - not just nova-compute.
> >>>>>
> >>>>>
> >>>>> Just to clarify: have you experienced this outside of OpenStack (or
> >>>>> Oslo)?
> >>>>>
> >>>>> We've seen similar issues with rabbitmq and OpenStack. We used to run
> >>>>> rabbit through haproxy and tried a myriad of options like setting no
> >>>>> timeouts, very very long timeouts, etc, but would always eventually
> >>>>>see
> >>>>> similar issues as described.
> >>>>>
> >>>>> Last month, we reconfigured all OpenStack components to use the
> >>>>> `rabbit_hosts` option with all nodes in our cluster listed. So far
> >>>>>this has
> >>>>> worked well, though I probably just jinxed myself. :)
> >>>>>
> >>>>> We still have other services (like Sensu) using the same rabbitmq
> >>>>> cluster and accessing it through haproxy. We've never had any issues
> >>>>>there.
> >>>>>
> >>>>> What's also strange is that I have another OpenStack deployment (from
> >>>>> Folsom to Icehouse) with just a single rabbitmq server installed
> >>>>>directly on
> >>>>> the cloud controller (meaning: no nova-compute). I never have any
> >>>>>rabbit
> >>>>> issues in that cloud.
> >>>>>
> >>>>> _______________________________________________
> >>>>> OpenStack-operators mailing list
> >>>>> OpenStack-operators at lists.openstack.org
> >>>>>
> >>>>>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operator
> >>>>>s
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> OpenStack-operators mailing list
> >>>> OpenStack-operators at lists.openstack.org
> >>>>
> >>>>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> >>>>
> >>> _______________________________________________
> >>> OpenStack-operators mailing list
> >>> OpenStack-operators at lists.openstack.org
> >>>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> OpenStack-operators mailing list
> >>> OpenStack-operators at lists.openstack.org
> >>>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> >>>
> >>
> >>
> >> _______________________________________________
> >> OpenStack-operators mailing list
> >> OpenStack-operators at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> >>
> >
> >
> >
> >--
> >Andrew
> >Mirantis
> >Ceph community
> >
> >_______________________________________________
> >OpenStack-operators mailing list
> >OpenStack-operators at lists.openstack.org
> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20150205/23083aaa/attachment-0001.html>


More information about the OpenStack-operators mailing list