[Openstack-operators] [openstack-dev] [nova] Rabbit-mq 3.4 crashing (anyone else seen this?)
Rochelle Grober
rochelle.grober at huawei.com
Wed Jul 6 21:18:42 UTC 2016
repository is: http://git.openstack.org/cgit/openstack/osops-tools-contrib/
FYI, there are also: osops-tools-generic, osops-tools-logging, osops-tools-monitoring, osops-example-configs and osops-coda
Wish I could help more,
--Rocky
-----Original Message-----
From: Joshua Harlow [mailto:harlowja at fastmail.com]
Sent: Tuesday, July 05, 2016 10:44 AM
To: Matt Fischer
Cc: openstack-dev at lists.openstack.org; OpenStack Operators
Subject: Re: [openstack-dev] [Openstack-operators] [nova] Rabbit-mq 3.4 crashing (anyone else seen this?)
Ah, those sets of command sound pretty nice to run periodically,
Sounds like a useful script that could be placed in the ops tools repo
(I forget where this repo exists at, but pretty sure it does exist?).
Some other oddness though is that this issue seems to go away when we
don't run cross-release; do you see that also?
Another hypothesis was that the following fix may be triggering part of
this @ https://bugs.launchpad.net/oslo.messaging/+bug/1495568
So that if we have some queues being set up as auto-delete and some
beign set up with expiry that perhaps the combination of these causes
more work (and therefore eventually it falls behind and falls over) for
the management database.
Matt Fischer wrote:
> Yes! This happens often but I'd not call it a crash, just the mgmt db
> gets behind then eats all the memory. We've started monitoring it and
> have runbooks on how to bounce just the mgmt db. Here are my notes on that:
>
> restart rabbitmq mgmt server - this seems to clear the memory usage.
>
> rabbitmqctl eval 'application:stop(rabbitmq_management).'
> rabbitmqctl eval 'application:start(rabbitmq_management).'
>
> run GC on rabbit_mgmt_db:
> rabbitmqctl eval
> '(erlang:garbage_collect(global:whereis_name(rabbit_mgmt_db)))'
>
> status of rabbit_mgmt_db:
> rabbitmqctl eval 'sys:get_status(global:whereis_name(rabbit_mgmt_db)).'
>
> Rabbitmq mgmt DB how much memory is used:
> /usr/sbin/rabbitmqctl status | grep mgmt_db
>
> Unfortunately I didn't see that an upgrade would fix for sure and any
> settings changes to reduce the number of monitored events also require a
> restart of the cluster. The other issue with an upgrade for us is the
> ancient version of erlang shipped with trusty. When we upgrade to Xenial
> we'll upgrade erlang and rabbit and hope it goes away. I'll also
> probably tweak the settings on retention of events then too.
>
> Also for the record the GC doesn't seem to help at all.
>
> On Jul 5, 2016 11:05 AM, "Joshua Harlow" <harlowja at fastmail.com
> <mailto:harlowja at fastmail.com>> wrote:
>
> Hi ops and dev-folks,
>
> We over at godaddy (running rabbitmq with openstack) have been
> hitting a issue that has been causing the `rabbit_mgmt_db` consuming
> nearly all the processes memory (after a given amount of time),
>
> We've been thinking that this bug (or bugs?) may have existed for a
> while and our dual-version-path (where we upgrade the control plane
> and then slowly/eventually upgrade the compute nodes to the same
> version) has somehow triggered this memory leaking bug/issue since
> it has happened most prominently on our cloud which was running
> nova-compute at kilo and the other services at liberty (thus using
> the versioned objects code path more frequently due to needing
> translations of objects).
>
> The rabbit we are running is 3.4.0 on CentOS Linux release 7.2.1511
> with kernel 3.10.0-327.4.4.el7.x86_64 (do note that upgrading to
> 3.6.2 seems to make the issue go away),
>
> # rpm -qa | grep rabbit
>
> rabbitmq-server-3.4.0-1.noarch
>
> The logs that seem relevant:
>
> ```
> **********************************************************
> *** Publishers will be blocked until this alarm clears ***
> **********************************************************
>
> =INFO REPORT==== 1-Jul-2016::16:37:46 ===
> accepting AMQP connection <0.23638.342> (127.0.0.1:51932
> <http://127.0.0.1:51932> -> 127.0.0.1:5671 <http://127.0.0.1:5671>)
>
> =INFO REPORT==== 1-Jul-2016::16:37:47 ===
> vm_memory_high_watermark clear. Memory used:29910180640
> allowed:47126781542
> ```
>
> This happens quite often, the crashes have been affecting our cloud
> over the weekend (which made some dev/ops not so happy especially
> due to the july 4th mini-vacation),
>
> Looking to see if anyone else has seen anything similar?
>
> For those interested this is the upstream bug/mail that I'm also
> seeing about getting confirmation from the upstream users/devs
> (which also has erlang crash dumps attached/linked),
>
> https://groups.google.com/forum/#!topic/rabbitmq-users/FeBK7iXUcLg
>
> Thanks,
>
> -Josh
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> <mailto:OpenStack-operators at lists.openstack.org>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-operators
mailing list