Neutron RabbitMQ issues

Satish Patel satish.txt at gmail.com
Thu Mar 19 18:53:22 UTC 2020


I have question related following setting, why are you disabling
heartbeat timeout?

heartbeat_timeout_threshold = 0

On Thu, Mar 19, 2020 at 1:32 PM Satish Patel <satish.txt at gmail.com> wrote:

> Great, thanks!  Did you guys tune your nova component for rabbitMQ?
>
> On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant at civo.com> wrote:
>
>> We left ours on the default value of 1 and that still seems to be fine.
>>
>> Grant
>> On 19/03/2020 17:13, Satish Patel wrote:
>>
>> how about rpc_worker ?
>>
>> currently i have rpc_worker=1
>>
>> On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant at civo.com> wrote:
>>
>>> Correct, you need to add:
>>>
>>> > > heartbeat_timeout_threshold = 0
>>> > > rpc_conn_pool_size = 300
>>> > > rpc_thread_pool_size = 2048
>>> > > rpc_response_timeout = 3600
>>> > > rpc_poll_timeout = 60
>>>
>>> To your Neutron nodes
>>>
>>> And you can add:
>>>
>>>
>>> >> executor_thread_pool_size = 64
>>> >> rpc_response_timeout = 3600
>>>
>>> To your compute nodes (neutron.conf) However I found just adding the
>>> changes to the neturon servers really helped.
>>>
>>> I would recommend just starting with your neutron nodes first to see if
>>> that helps. If you find your compute nodes are still having issues then
>>> change the settings on those after.
>>>
>>> Regards,
>>> On 19/03/2020 16:53, Satish Patel wrote:
>>>
>>> I am running openstack-ansible (Queens / Stein both) so this is what i
>>> am going to do, am i doing correctly?
>>>
>>> neutron-server (container) I have 3 neutron node.
>>> > > heartbeat_timeout_threshold = 0
>>> > > rpc_conn_pool_size = 300
>>> > > rpc_thread_pool_size = 2048
>>> > > rpc_response_timeout = 3600
>>> > > rpc_poll_timeout = 60
>>>
>>> 330 compute nodes (agent neutron.conf) going to add following:
>>> >> executor_thread_pool_size = 64
>>> >> rpc_response_timeout = 3600
>>>
>>>
>>>
>>> How about nova? should i be doing that on nova as well to reduce load on
>>> rabbitMQ?
>>>
>>>
>>> On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant at civo.com> wrote:
>>>
>>>> Hi Satish,
>>>>
>>>> You will need to add those to the "neutron.conf" file on your network
>>>> nodes. If you are running OS-A I would do it on your "neutron-server" nodes
>>>> and add the following to your agents containers:
>>>>
>>>> executor_thread_pool_size = 64
>>>> rpc_response_timeout = 3600
>>>>
>>>> Regards,
>>>> On 19/03/2020 16:27, Satish Patel wrote:
>>>>
>>>> Erik,
>>>>
>>>> If i want to adopt following setting then where i should add them in
>>>> Queens openstack, neutron-server or all my compute nodes?  which
>>>> setting will go where?
>>>>
>>>> heartbeat_timeout_threshold = 0
>>>> rpc_conn_pool_size = 300
>>>> rpc_thread_pool_size = 2048
>>>> rpc_response_timeout = 3600
>>>> rpc_poll_timeout = 60
>>>>
>>>> ## Rpc all
>>>> executor_thread_pool_size = 64
>>>> rpc_response_timeout = 3600
>>>>
>>>> On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson<eandersson at blizzard.com> <eandersson at blizzard.com> wrote:
>>>>
>>>> We are hitting something awfully similar.
>>>>
>>>> We have basically been hitting a few pretty serious bugs with RabbitMQ.
>>>>
>>>> The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.
>>>>
>>>> e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641
>>>>
>>>> I wrote two quick scripts to audit these issues.http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment).http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.
>>>>
>>>> The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.
>>>>
>>>> Best Regards, Erik Olof Gunnar Andersson
>>>>
>>>> -----Original Message-----
>>>> From: Satish Patel <satish.txt at gmail.com> <satish.txt at gmail.com>
>>>> Sent: Wednesday, March 11, 2020 5:14 PM
>>>> To: Grant Morley <grant at civo.com> <grant at civo.com>
>>>> Cc: openstack-discuss at lists.openstack.org
>>>> Subject: Re: Neutron RabbitMQ issues
>>>>
>>>> I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.
>>>>
>>>> This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!Ci6f514n9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvarUEZv4Mw$
>>>>
>>>> On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant at civo.com> <grant at civo.com> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We are currently experiencing some fairly major issues with our
>>>> OpenStack cluster. It all appears to be with Neutron and RabbitMQ.  We
>>>> are seeing a lot of time out messages in responses to replies and
>>>> because of this instance creation or anything to do with instances and
>>>> networking is broken.
>>>>
>>>> We are running OpenStack Queens.
>>>>
>>>> We have already tuned Rabbit for Neutron by doing the following on neutron:
>>>>
>>>> heartbeat_timeout_threshold = 0
>>>> rpc_conn_pool_size = 300
>>>> rpc_thread_pool_size = 2048
>>>> rpc_response_timeout = 3600
>>>> rpc_poll_timeout = 60
>>>>
>>>> ## Rpc all
>>>> executor_thread_pool_size = 64
>>>> rpc_response_timeout = 3600
>>>>
>>>> What we are seeing in the error logs for neutron for all services
>>>> (l3-agent, dhcp, linux-bridge etc ) are these timeouts:
>>>> https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n
>>>> 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK
>>>> 9aOA$
>>>>
>>>> We have manually tried to get everything in sync by forcing fail-over
>>>> of the networking which seems to get routers in sync.
>>>>
>>>> We are also seeing that there are a lot of "unacknowledged" messages
>>>> in RabbitMQ for 'q-plugin' in the neutron queues.
>>>>
>>>> Some times restarting of the services on neutron gets these back
>>>> acknowledged again, however the timeouts come back.
>>>>
>>>> The RabbitMQ servers themselves are not loaded at all. All memory,
>>>> file descriptors and errlang processes have plenty of resources available.
>>>>
>>>> We are also seeing a lot of rpc issues:
>>>>
>>>> Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds
>>>> before next attempt. If the server is not down, consider increasing
>>>> the rpc_response_timeout option as Neutron server(s) may be overloaded
>>>> and unable to respond quickly enough.: MessagingTimeout: Timed out
>>>> waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4
>>>> 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc
>>>> [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC
>>>> method release_dhcp_port. Waiting for 3347 seconds before next attempt.
>>>> If the server is not down, consider increasing the
>>>> rpc_response_timeout option as Neutron server(s) may be overloaded and
>>>> unable to respond quickly enough.: MessagingTimeout: Timed out waiting
>>>> for a reply to message ID 7937465f15634fbfa443fe1758a12a9c
>>>>
>>>> Does anyone know if there is anymore tuning to be done at all?
>>>> Upgrading for us at the moment to a newer version isn't really an
>>>> option unfortunately.
>>>>
>>>> Because of our setup, we also have roughly 800 routers enabled and I
>>>> know that will be putting a load on the system. However these problems
>>>> have only started to happen roughly 1 week ago and have steadily got worse.
>>>>
>>>> If anyone has any use cases for this or any more recommendations that
>>>> would be great.
>>>>
>>>> Many thanks,
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Grant Morley
>>>> Cloud Lead, Civo Ltd
>>>> www.civo.com | Signup for an account! <https://www.civo.com/signup>
>>>>
>>> --
>>>
>>> Grant Morley
>>> Cloud Lead, Civo Ltd
>>> www.civo.com | Signup for an account! <https://www.civo.com/signup>
>>>
>> --
>>
>> Grant Morley
>> Cloud Lead, Civo Ltd
>> www.civo.com | Signup for an account! <https://www.civo.com/signup>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200319/6b7f0f14/attachment.html>


More information about the openstack-discuss mailing list