[Openstack-operators] [oslo] RabbitMQ queue TTL issues moving to Liberty

Davanum Srinivas davanum at gmail.com
Thu Jul 28 12:31:48 UTC 2016


Dima, Kevin,

There are PreStop hooks that can be used to gracefully bring down
stuff running in containers:
http://kubernetes.io/docs/user-guide/container-environment/

-- Dims

On Thu, Jul 28, 2016 at 8:22 AM, Dmitry Mescheryakov
<dmescheryakov at mirantis.com> wrote:
>
> 2016-07-26 21:20 GMT+03:00 Fox, Kevin M <Kevin.Fox at pnnl.gov>:
>>
>> It only relates to Kubernetes in that Kubernetes can do automatic rolling
>> upgrades by destroying/replacing a service. If the services don't clean up
>> after themselves, then performing a rolling upgrade will break things.
>>
>> So, what do you think is the best approach to ensuring all the services
>> shut things down properly? Seems like its a cross project issue? Should a
>> spec be submitted?
>
>
> I think that it would be fair if Kubernates sends a sigterm to OpenStack
> service in a container, then wait for the service to shut down and only then
> destroy the container.
>
> It might be not very important for our case though, if we agree to split
> expiration time for fanout and reply queues. And I don't know of any other
> case where an OpenStack service needs to clean up on shutdown in some
> external place.
>
> Thanks,
>
> Dmitry
>
>>
>> Thanks,
>> Kevin
>> ________________________________
>> From: Dmitry Mescheryakov [dmescheryakov at mirantis.com]
>> Sent: Tuesday, July 26, 2016 11:01 AM
>> To: Fox, Kevin M
>> Cc: Sam Morrison; OpenStack Operators
>>
>> Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues moving
>> to Liberty
>>
>>
>>
>> 2016-07-25 18:47 GMT+03:00 Fox, Kevin M <Kevin.Fox at pnnl.gov>:
>>>
>>> Ah. Interesting.
>>>
>>> The graceful shutdown would really help the Kubernetes situation too.
>>> Kubernetes can do easy rolling upgrades and having the processes being able
>>> to clean up after themselves as they are upgraded is important. Is this
>>> something that needs to go into oslo.messaging or does it have to be added
>>> to all projects using it?
>>
>>
>> It both needs to be fixed on oslo.messaging side (delete fanout queue on
>> RPC server stop, which is done by Kirill's CR) and on side of projects using
>> it, as they need to actually stop RPC server before shutting down. As I
>> wrote earlier, among Neutron processes right now only openvswitch and
>> metadata agents do not stop RPC server.
>>
>> I am not sure how that relates to Kubernates, as I not much familiar with
>> it.
>>
>> Thanks,
>>
>> Dmitry
>>
>>>
>>>
>>> Thanks,
>>> Kevin
>>> ________________________________
>>> From: Dmitry Mescheryakov [dmescheryakov at mirantis.com]
>>> Sent: Monday, July 25, 2016 3:47 AM
>>> To: Sam Morrison
>>> Cc: OpenStack Operators
>>> Subject: Re: [Openstack-operators] [oslo] RabbitMQ queue TTL issues
>>> moving to Liberty
>>>
>>> Sam,
>>>
>>> For your case I would suggest to lower rabbit_transient_queues_ttl until
>>> you are comfortable with volume of messages which comes during that time.
>>> Setting the parameter to 1 will essentially replicate bahaviour of
>>> auto_delete queues. But I would suggest not to set it that low, as otherwise
>>> your OpenStack will suffer from the original bug. Probably a value like 20
>>> seconds should work in most cases.
>>>
>>> I think that there is a space for improvement here - we can delete reply
>>> and fanout queues on graceful shutdown. But I am not sure if it will be easy
>>> to implement, as it requires services (Nova, Neutron, etc.) to stop RPC
>>> server on sigint and I don't know if they do it right now.
>>>
>>> I don't think we can make case with sigkill any better. Other than that,
>>> the issue could be investigated on Neutron side, maybe number of messages
>>> could be reduced there.
>>>
>>> Thanks,
>>>
>>> Dmitry
>>>
>>> 2016-07-25 9:27 GMT+03:00 Sam Morrison <sorrison at gmail.com>:
>>>>
>>>> We recently upgraded to Liberty and have come across some issues with
>>>> queue build ups.
>>>>
>>>> This is due to changes in rabbit to set queue expiries as opposed to
>>>> queue auto delete.
>>>> See https://bugs.launchpad.net/oslo.messaging/+bug/1515278 for more
>>>> information.
>>>>
>>>> The fix for this bug is in liberty and it does fix an issue however it
>>>> causes another one.
>>>>
>>>> Every time you restart something that has a fanout queue. Eg.
>>>> cinder-scheduler or the neutron agents you will have
>>>> a queue in rabbit that is still bound to the rabbitmq exchange (and so
>>>> still getting messages in) but no consumers.
>>>>
>>>> These messages in these queues are basically rubbish and don’t need to
>>>> exist. Rabbit will delete these queues after 10 mins (although the default
>>>> in master is now changed to 30 mins)
>>>>
>>>> During this time the queue will grow and grow with messages. This sets
>>>> off our nagios alerts and our ops guys have to deal with something that
>>>> isn’t really an issue. They basically delete the queue.
>>>>
>>>> A bad scenario is when you make a change to your cloud that means all
>>>> your 1000 neutron agents are restarted, this causes a couple of dead queues
>>>> per agent to hang around. (port updates and security group updates) We get
>>>> around 25 messages / second on these queues and so you can see after 10
>>>> minutes we have a ton of messages in these queues.
>>>>
>>>> 1000 x 2 x 25 x 600 = 30,000,000 messages in 10 minutes to be precise.
>>>>
>>>> Has anyone else been suffering with this before a raise a bug?
>>>>
>>>> Cheers,
>>>> Sam
>>>>
>>>>
>>>> _______________________________________________
>>>> OpenStack-operators mailing list
>>>> OpenStack-operators at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>
>>>
>>
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>



-- 
Davanum Srinivas :: https://twitter.com/dims



More information about the OpenStack-operators mailing list