[Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

Dmitry Mescheryakov dmescheryakov at mirantis.com
Mon Jul 25 13:48:20 UTC 2016


I have filed a bug in oslo.messaging to track the issue [1] and my
colleague Kirill Bespalov posted a fix for it [2].

We have checked the fix and it is working for neutron-server, l3-agent and
dhcp-agent. It does not work for openvswitch-agent and metadata-agent
meaning they do not stop RPC server on shutdown.

But I would expect that absolute majority of fanout messages come from l3
agent and we can neglect these two. Does it coincide with your observations?

Thanks,

Dmitry

[1] https://bugs.launchpad.net/oslo.messaging/+bug/1606213
[2] https://review.openstack.org/#/c/346732/

2016-07-25 13:47 GMT+03:00 Dmitry Mescheryakov <dmescheryakov at mirantis.com>:

> Sam,
>
> For your case I would suggest to lower rabbit_transient_queues_ttl until
> you are comfortable with volume of messages which comes during that time.
> Setting the parameter to 1 will essentially replicate bahaviour of
> auto_delete queues. But I would suggest not to set it that low, as
> otherwise your OpenStack will suffer from the original bug. Probably a
> value like 20 seconds should work in most cases.
>
> I think that there is a space for improvement here - we can delete reply
> and fanout queues on graceful shutdown. But I am not sure if it will be
> easy to implement, as it requires services (Nova, Neutron, etc.) to stop
> RPC server on sigint and I don't know if they do it right now.
>
> I don't think we can make case with sigkill any better. Other than that,
> the issue could be investigated on Neutron side, maybe number of messages
> could be reduced there.
>
> Thanks,
>
> Dmitry
>
> 2016-07-25 9:27 GMT+03:00 Sam Morrison <sorrison at gmail.com>:
>
>> We recently upgraded to Liberty and have come across some issues with
>> queue build ups.
>>
>> This is due to changes in rabbit to set queue expiries as opposed to
>> queue auto delete.
>> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more
>> information.
>>
>> The fix for this bug is in liberty and it does fix an issue however it
>> causes another one.
>>
>> Every time you restart something that has a fanout queue. Eg.
>> cinder-scheduler or the neutron agents you will have
>> a queue in rabbit that is still bound to the rabbitmq exchange (and so
>> still getting messages in) but no consumers.
>>
>> These messages in these queues are basically rubbish and don’t need to
>> exist. Rabbit will delete these queues after 10 mins (although the default
>> in master is now changed to 30 mins)
>>
>> During this time the queue will grow and grow with messages. This sets
>> off our nagios alerts and our ops guys have to deal with something that
>> isn’t really an issue. They basically delete the queue.
>>
>> A bad scenario is when you make a change to your cloud that means all
>> your 1000 neutron agents are restarted, this causes a couple of dead queues
>> per agent to hang around. (port updates and security group updates) We get
>> around 25 messages / second on these queues and so you can see after 10
>> minutes we have a ton of messages in these queues.
>>
>> 1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.
>>
>> Has anyone else been suffering with this before a raise a bug?
>>
>> Cheers,
>> Sam
>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20160725/fa633fe8/attachment.html>


More information about the OpenStack-operators mailing list