Neutron RabbitMQ issues

older
[release] Release countdown for...

Grant Morley

10 Mar 2020 10 Mar '20

7:18 p.m.

Hi all, We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken. We are running OpenStack Queens. We have already tuned Rabbit for Neutron by doing the following on neutron: heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60 ## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600 What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts: https://pastebin.com/Fjh23A5a We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync. We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues. Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back. The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available. We are also seeing a lot of rpc issues: Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately. Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse. If anyone has any use cases for this or any more recommendations that would be great. Many thanks,

Show replies by date

Satish Patel

12 Mar 12 Mar

12:13 a.m.

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue. This is my favorite video, not sure you have seen this before or not but anyway posting here - https://www.youtube.com/watch?v=bpmgxrPOrZw On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> wrote:

...

Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://pastebin.com/Fjh23A5a

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

Erik Olof Gunnar Andersson

1:05 a.m.

We are hitting something awfully similar. We have basically been hitting a few pretty serious bugs with RabbitMQ. The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues. e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641 I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue. The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state. Best Regards, Erik Olof Gunnar Andersson -----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue. This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!... On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> wrote:

...

Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

Satish Patel

3:29 a.m.

Totally agreed with you, I had similar issue when my cluster got split and not able to recover from that state then finally i have to re-build it from scratch to make it functional. There isn't any good guideline about rabbitmq capacity planning, every deployment is unique. Anyway thanks for those script i will hook them up with my monitoring system. On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...

We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> wrote:

...
Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

Satish Patel

19 Mar 19 Mar

4:27 p.m.

Erik, If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where? heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60 ## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600 On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...

We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> wrote:

...
Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

Grant Morley

4:35 p.m.

Hi Satish, You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers: executor_thread_pool_size = 64 rpc_response_timeout = 3600 Regards, On 19/03/2020 16:27, Satish Patel wrote:

...

Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...
We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> wrote:

...
Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

Grant Morley

4:37 p.m.

To add to this as well, I would recommend you run the latest version of the neutron code "17.1.17" for OS-A Queens as well. We found that upgrading to that version of the code as well as making those changes really helped. Grant On 19/03/2020 16:35, Grant Morley wrote:

...

Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards,

On 19/03/2020 16:27, Satish Patel wrote:

...
Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...
We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g.https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel<satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley<grant@civo.com> Cc:openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here -https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley<grant@civo.com> wrote:

...
Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

-- Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

Satish Patel

4:53 p.m.

I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly? neutron-server (container) I have 3 neutron node.

...

...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following:

...

...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ? On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com> wrote:

...

Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards, On 19/03/2020 16:27, Satish Patel wrote:

Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson<eandersson@blizzard.com> <eandersson@blizzard.com> wrote:

We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues.http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment).http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> <grant@civo.com> wrote:

Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts: https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

Grant Morley

5:02 p.m.

Correct, you need to add:

...

...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

To your Neutron nodes And you can add:

...

...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

...

I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node.

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards,

On 19/03/2020 16:27, Satish Patel wrote:

...
Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> <mailto:eandersson@blizzard.com> wrote:

...
We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g.https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel<satish.txt@gmail.com> <mailto:satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley<grant@civo.com> <mailto:grant@civo.com> Cc:openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here -https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley<grant@civo.com> <mailto:grant@civo.com> wrote:

...
Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

-- Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

Satish Patel

5:13 p.m.

how about rpc_worker ? currently i have rpc_worker=1 On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com> wrote:

...

Correct, you need to add:

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards, On 19/03/2020 16:53, Satish Patel wrote:

I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node.

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com> wrote:

...
Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards, On 19/03/2020 16:27, Satish Patel wrote:

Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson<eandersson@blizzard.com> <eandersson@blizzard.com> wrote:

We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues.http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment).http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> <grant@civo.com> wrote:

Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts: https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

Grant Morley

5:26 p.m.

We left ours on the default value of 1 and that still seems to be fine. Grant On 19/03/2020 17:13, Satish Patel wrote:

...

how about rpc_worker ? currently i have rpc_worker=1

On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Correct, you need to add:

> > heartbeat_timeout_threshold = 0 > > rpc_conn_pool_size = 300 > > rpc_thread_pool_size = 2048 > > rpc_response_timeout = 3600 > > rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

>> executor_thread_pool_size = 64 >> rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards,

On 19/03/2020 16:53, Satish Patel wrote:

...
I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node. > > heartbeat_timeout_threshold = 0 > > rpc_conn_pool_size = 300 > > rpc_thread_pool_size = 2048 > > rpc_response_timeout = 3600 > > rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following: >> executor_thread_pool_size = 64 >> rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards,

On 19/03/2020 16:27, Satish Patel wrote:

...
Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> <mailto:eandersson@blizzard.com> wrote:

...
We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g.https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel<satish.txt@gmail.com> <mailto:satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley<grant@civo.com> <mailto:grant@civo.com> Cc:openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here -https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley<grant@civo.com> <mailto:grant@civo.com> wrote:

...
Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

-- Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

Satish Patel

5:32 p.m.

Great, thanks! Did you guys tune your nova component for rabbitMQ? On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant@civo.com> wrote:

...

We left ours on the default value of 1 and that still seems to be fine.

Grant On 19/03/2020 17:13, Satish Patel wrote:

how about rpc_worker ?

currently i have rpc_worker=1

On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com> wrote:

...
Correct, you need to add:

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards, On 19/03/2020 16:53, Satish Patel wrote:

I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node.

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com> wrote:

...
Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards, On 19/03/2020 16:27, Satish Patel wrote:

Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson<eandersson@blizzard.com> <eandersson@blizzard.com> wrote:

We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues.http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment).http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> <grant@civo.com> wrote:

Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts: https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

Satish Patel

6:53 p.m.

I have question related following setting, why are you disabling heartbeat timeout? heartbeat_timeout_threshold = 0 On Thu, Mar 19, 2020 at 1:32 PM Satish Patel <satish.txt@gmail.com> wrote:

...

Great, thanks! Did you guys tune your nova component for rabbitMQ?

On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant@civo.com> wrote:

...
We left ours on the default value of 1 and that still seems to be fine.

Grant On 19/03/2020 17:13, Satish Patel wrote:

how about rpc_worker ?

currently i have rpc_worker=1

On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com> wrote:

...
Correct, you need to add:

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards, On 19/03/2020 16:53, Satish Patel wrote:

I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node.

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com> wrote:

...
Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards, On 19/03/2020 16:27, Satish Patel wrote:

Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson<eandersson@blizzard.com> <eandersson@blizzard.com> wrote:

We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues.http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment).http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> <grant@civo.com> wrote:

Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts: https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

Satish Patel

8:46 p.m.

One more question where i should add these option in neutron.conf file. I have following to section, should i be adding all those option inside oslo_messaging_rabbit or DEFAULT? also what is the difference between executor_thread_pool_size vs rpc_thread_pool_size ? or they are both samething? [DEFAULT] ... executor_thread_pool_size = 64 rpc_response_timeout = 60 ... ... [oslo_messaging_rabbit] ssl = True rpc_conn_pool_size = 30 On Thu, Mar 19, 2020 at 2:53 PM Satish Patel <satish.txt@gmail.com> wrote:

...

I have question related following setting, why are you disabling heartbeat timeout?

heartbeat_timeout_threshold = 0

On Thu, Mar 19, 2020 at 1:32 PM Satish Patel <satish.txt@gmail.com> wrote:

...
Great, thanks! Did you guys tune your nova component for rabbitMQ?

On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant@civo.com> wrote:

...
We left ours on the default value of 1 and that still seems to be fine.

Grant On 19/03/2020 17:13, Satish Patel wrote:

how about rpc_worker ?

currently i have rpc_worker=1

On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com> wrote:

...
Correct, you need to add:

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards, On 19/03/2020 16:53, Satish Patel wrote:

I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node.

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com> wrote:

...
Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards, On 19/03/2020 16:27, Satish Patel wrote:

Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson<eandersson@blizzard.com> <eandersson@blizzard.com> wrote:

We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues.http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment).http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> <grant@civo.com> wrote:

Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts: https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

Grant Morley

20 Mar 20 Mar

8:55 a.m.

Hi, There was a bug in Queens that meant there was an issue with the heartbeat timeouts. Setting it to 0 gets around that bug. I believe that was fixed in Rocky and above, so your Stein installation should be fine. Setting the value to 0 For us meant we stopped getting errors in the logs for: "Too many heartbeats missed, trying to force connect to RabbitMQ" Regards, On 19/03/2020 18:53, Satish Patel wrote:

...

I have question related following setting, why are you disabling heartbeat timeout?

heartbeat_timeout_threshold = 0

On Thu, Mar 19, 2020 at 1:32 PM Satish Patel <satish.txt@gmail.com <mailto:satish.txt@gmail.com>> wrote:

Great, thanks! Did you guys tune your nova component for rabbitMQ?

On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

We left ours on the default value of 1 and that still seems to be fine.

Grant

On 19/03/2020 17:13, Satish Patel wrote:

...
how about rpc_worker ? currently i have rpc_worker=1

On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Correct, you need to add:

> > heartbeat_timeout_threshold = 0 > > rpc_conn_pool_size = 300 > > rpc_thread_pool_size = 2048 > > rpc_response_timeout = 3600 > > rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

>> executor_thread_pool_size = 64 >> rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards,

On 19/03/2020 16:53, Satish Patel wrote:

...
I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node. > > heartbeat_timeout_threshold = 0 > > rpc_conn_pool_size = 300 > > rpc_thread_pool_size = 2048 > > rpc_response_timeout = 3600 > > rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following: >> executor_thread_pool_size = 64 >> rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards,

On 19/03/2020 16:27, Satish Patel wrote:

...
Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> <mailto:eandersson@blizzard.com> wrote:

...
We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g.https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel<satish.txt@gmail.com> <mailto:satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley<grant@civo.com> <mailto:grant@civo.com> Cc:openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here -https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley<grant@civo.com> <mailto:grant@civo.com> wrote:

...
Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

-- Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

Satish Patel

2:38 p.m.

Grant, But i am seeing lots of following logs on my compute nodes running stein release. 2020-03-20 10:34:46.132 53425 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed. Find attached screenshot of one of rabbitMQ node, i have lots of messages in "Ready: 15339" is this looks normal to you? On Fri, Mar 20, 2020 at 4:55 AM Grant Morley <grant@civo.com> wrote:

...

Hi,

There was a bug in Queens that meant there was an issue with the heartbeat timeouts. Setting it to 0 gets around that bug. I believe that was fixed in Rocky and above, so your Stein installation should be fine.

Setting the value to 0 For us meant we stopped getting errors in the logs for:

"Too many heartbeats missed, trying to force connect to RabbitMQ"

Regards, On 19/03/2020 18:53, Satish Patel wrote:

I have question related following setting, why are you disabling heartbeat timeout?

heartbeat_timeout_threshold = 0

On Thu, Mar 19, 2020 at 1:32 PM Satish Patel <satish.txt@gmail.com> wrote:

...
Great, thanks! Did you guys tune your nova component for rabbitMQ?

On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant@civo.com> wrote:

...
We left ours on the default value of 1 and that still seems to be fine.

Grant On 19/03/2020 17:13, Satish Patel wrote:

how about rpc_worker ?

currently i have rpc_worker=1

On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com> wrote:

...
Correct, you need to add:

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards, On 19/03/2020 16:53, Satish Patel wrote:

I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node.

...
...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following:

...
...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com> wrote:

...
Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards, On 19/03/2020 16:27, Satish Patel wrote:

Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson<eandersson@blizzard.com> <eandersson@blizzard.com> wrote:

We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues.http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment).http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> <grant@civo.com> wrote:

Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts: https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

Grant Morley

3:39 p.m.

If you tune rabbit then for: heartbeat_timeout_threshold = 0 That should help with the error message you are getting. That is a lot of messages queued. We had the same because we were not using ceilometer but had the "notifications" still turned on for it for services. Are all of the ready messages for "notifications.info" for the various services ( Nova, Neutron, Keystone etc ) If that is the case you can disable those messages in your config files for those services. Look for: # Notifications [oslo_messaging_notifications] notification_topics = notifications driver = noop Make sure the driver option is set to "noop" by default it will be set too "messagingv2" and then restart the service and that should stop sending messages to the queue. You can then purge the "notifications.info" queue for Nova or Neutron etc.. We only had the "message ready" when we had the setting for ceilometer set but was not using it. Also only purge a queue if it is for that reason. Do not purge the queue if it is for any other reason than that as it can cause issues. Hope that helps. Grant On 20/03/2020 14:38, Satish Patel wrote:

...

Grant,

But i am seeing lots of following logs on my compute nodes running stein release.

2020-03-20 10:34:46.132 53425 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed.

Find attached screenshot of one of rabbitMQ node, i have lots of messages in "Ready: 15339" is this looks normal to you?

On Fri, Mar 20, 2020 at 4:55 AM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Hi,

There was a bug in Queens that meant there was an issue with the heartbeat timeouts. Setting it to 0 gets around that bug. I believe that was fixed in Rocky and above, so your Stein installation should be fine.

Setting the value to 0 For us meant we stopped getting errors in the logs for:

"Too many heartbeats missed, trying to force connect to RabbitMQ"

Regards,

On 19/03/2020 18:53, Satish Patel wrote:

...
I have question related following setting, why are you disabling heartbeat timeout?

heartbeat_timeout_threshold = 0

On Thu, Mar 19, 2020 at 1:32 PM Satish Patel <satish.txt@gmail.com <mailto:satish.txt@gmail.com>> wrote:

Great, thanks! Did you guys tune your nova component for rabbitMQ?

On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

We left ours on the default value of 1 and that still seems to be fine.

Grant

On 19/03/2020 17:13, Satish Patel wrote:

...
how about rpc_worker ? currently i have rpc_worker=1

On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Correct, you need to add:

> > heartbeat_timeout_threshold = 0 > > rpc_conn_pool_size = 300 > > rpc_thread_pool_size = 2048 > > rpc_response_timeout = 3600 > > rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

>> executor_thread_pool_size = 64 >> rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards,

On 19/03/2020 16:53, Satish Patel wrote:

...
I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node. > > heartbeat_timeout_threshold = 0 > > rpc_conn_pool_size = 300 > > rpc_thread_pool_size = 2048 > > rpc_response_timeout = 3600 > > rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following: >> executor_thread_pool_size = 64 >> rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards,

On 19/03/2020 16:27, Satish Patel wrote:

...
Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> <mailto:eandersson@blizzard.com> wrote:

...
We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g.https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel<satish.txt@gmail.com> <mailto:satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley<grant@civo.com> <mailto:grant@civo.com> Cc:openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here -https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley<grant@civo.com> <mailto:grant@civo.com> wrote: > Hi all, > > We are currently experiencing some fairly major issues with our > OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We > are seeing a lot of time out messages in responses to replies and > because of this instance creation or anything to do with instances and > networking is broken. > > We are running OpenStack Queens. > > We have already tuned Rabbit for Neutron by doing the following on neutron: > > heartbeat_timeout_threshold = 0 > rpc_conn_pool_size = 300 > rpc_thread_pool_size = 2048 > rpc_response_timeout = 3600 > rpc_poll_timeout = 60 > > ## Rpc all > executor_thread_pool_size = 64 > rpc_response_timeout = 3600 > > What we are seeing in the error logs for neutron for all services > (l3-agent, dhcp, linux-bridge etc ) are these timeouts: > > https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n > 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK > 9aOA$ > > We have manually tried to get everything in sync by forcing fail-over > of the networking which seems to get routers in sync. > > We are also seeing that there are a lot of "unacknowledged" messages > in RabbitMQ for 'q-plugin' in the neutron queues. > > Some times restarting of the services on neutron gets these back > acknowledged again, however the timeouts come back. > > The RabbitMQ servers themselves are not loaded at all. All memory, > file descriptors and errlang processes have plenty of resources available. > > We are also seeing a lot of rpc issues: > > Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds > before next attempt. If the server is not down, consider increasing > the rpc_response_timeout option as Neutron server(s) may be overloaded > and unable to respond quickly enough.: MessagingTimeout: Timed out > waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 > 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc > [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC > method release_dhcp_port. Waiting for 3347 seconds before next attempt. > If the server is not down, consider increasing the > rpc_response_timeout option as Neutron server(s) may be overloaded and > unable to respond quickly enough.: MessagingTimeout: Timed out waiting > for a reply to message ID 7937465f15634fbfa443fe1758a12a9c > > Does anyone know if there is anymore tuning to be done at all? > Upgrading for us at the moment to a newer version isn't really an > option unfortunately. > > Because of our setup, we also have roughly 800 routers enabled and I > know that will be putting a load on the system. However these problems > have only started to happen roughly 1 week ago and have steadily got worse. > > If anyone has any use cases for this or any more recommendations that > would be great. > > Many thanks, > >

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

-- Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

Satish Patel

4:22 p.m.

Oh you are right here, i have following stuff in my neutron.conf on server # Notifications [oslo_messaging_notifications] driver = messagingv2 topics = notifications transport_url = rabbit://neutron:5be2a043f9a93adbd@172.28.15.192:5671, neutron:5be2a043f9a93adbd@172.28.15.248:5671, neutron:5be2a043f9a93adbd@172.28.15.22:5671//neutron?ssl=1 # Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 30 ssl = True Following change i am going to made let me know if anything missing. [DEFAULT] executor_thread_pool_size = 2048 <--- is this correct? i didn't see anywhere "rpc_thread_pool_size" rpc_response_timeout = 3600 [oslo_messaging_notifications] topics = notifications driver = noop # Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 300 heartbeat_timeout_threshold = 0 ssl = True Should i be adding this to all my compute nodes also ? On Fri, Mar 20, 2020 at 11:40 AM Grant Morley <grant@civo.com> wrote:

...

If you tune rabbit then for:

heartbeat_timeout_threshold = 0

That should help with the error message you are getting.

That is a lot of messages queued. We had the same because we were not using ceilometer but had the "notifications" still turned on for it for services.

Are all of the ready messages for "notifications.info" for the various services ( Nova, Neutron, Keystone etc )

If that is the case you can disable those messages in your config files for those services. Look for:

# Notifications [oslo_messaging_notifications] notification_topics = notifications driver = noop

Make sure the driver option is set to "noop" by default it will be set too "messagingv2" and then restart the service and that should stop sending messages to the queue. You can then purge the "notifications.info" queue for Nova or Neutron etc..

We only had the "message ready" when we had the setting for ceilometer set but was not using it. Also only purge a queue if it is for that reason. Do not purge the queue if it is for any other reason than that as it can cause issues.

Hope that helps.

Grant On 20/03/2020 14:38, Satish Patel wrote:

Grant,

But i am seeing lots of following logs on my compute nodes running stein release.

2020-03-20 10:34:46.132 53425 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed.

Find attached screenshot of one of rabbitMQ node, i have lots of messages in "Ready: 15339" is this looks normal to you?

On Fri, Mar 20, 2020 at 4:55 AM Grant Morley <grant@civo.com> wrote:

...
Hi,

There was a bug in Queens that meant there was an issue with the heartbeat timeouts. Setting it to 0 gets around that bug. I believe that was fixed in Rocky and above, so your Stein installation should be fine.

Setting the value to 0 For us meant we stopped getting errors in the logs for:

"Too many heartbeats missed, trying to force connect to RabbitMQ"

Regards, On 19/03/2020 18:53, Satish Patel wrote:

I have question related following setting, why are you disabling heartbeat timeout?

heartbeat_timeout_threshold = 0

On Thu, Mar 19, 2020 at 1:32 PM Satish Patel <satish.txt@gmail.com> wrote:

...
Great, thanks! Did you guys tune your nova component for rabbitMQ?

On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant@civo.com> wrote:

...
We left ours on the default value of 1 and that still seems to be fine.

Grant On 19/03/2020 17:13, Satish Patel wrote:

how about rpc_worker ?

currently i have rpc_worker=1

On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com> wrote:

...
Correct, you need to add:

...
> heartbeat_timeout_threshold = 0 > rpc_conn_pool_size = 300 > rpc_thread_pool_size = 2048 > rpc_response_timeout = 3600 > rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

...
> executor_thread_pool_size = 64 > rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards, On 19/03/2020 16:53, Satish Patel wrote:

I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node.

...
> heartbeat_timeout_threshold = 0 > rpc_conn_pool_size = 300 > rpc_thread_pool_size = 2048 > rpc_response_timeout = 3600 > rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following:

...
> executor_thread_pool_size = 64 > rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com> wrote:

...
Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards, On 19/03/2020 16:27, Satish Patel wrote:

Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson<eandersson@blizzard.com> <eandersson@blizzard.com> wrote:

We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues.http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment).http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> <satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com> <grant@civo.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com> <grant@civo.com> wrote:

Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts: https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com | Signup for an account! <https://www.civo.com/signup>

Erik Olof Gunnar Andersson

5:47 p.m.

Btw you might not necessarily be having RabbitMQ issues. You might also be experiencing something like this. https://bugs.launchpad.net/neutron/+bug/1853071 Best Regards, Erik Olof Gunnar Andersson From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 9:23 AM To: Grant Morley <grant@civo.com> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues Oh you are right here, i have following stuff in my neutron.conf on server # Notifications [oslo_messaging_notifications] driver = messagingv2 topics = notifications transport_url = rabbit://neutron:5be2a043f9a93adbd@172.28.15.192:5671<https://urldefense.com/v3/__http:/neutron:5be2a043f9a93adbd@172.28.15.192:5671__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSWmp4HJag$>,neutron:5be2a043f9a93adbd@172.28.15.248:5671<https://urldefense.com/v3/__http:/neutron:5be2a043f9a93adbd@172.28.15.248:5671__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXMR7XWRg$>,neutron:5be2a043f9a93adbd@172.28.15.22:5671//neutron?ssl=1<https://urldefense.com/v3/__http:/neutron:5be2a043f9a93adbd@172.28.15.22:5671/*neutron?ssl=1__;Lw!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSUgm8NHPA$> # Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 30 ssl = True Following change i am going to made let me know if anything missing. [DEFAULT] executor_thread_pool_size = 2048 <--- is this correct? i didn't see anywhere "rpc_thread_pool_size" rpc_response_timeout = 3600 [oslo_messaging_notifications] topics = notifications driver = noop # Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 300 heartbeat_timeout_threshold = 0 ssl = True Should i be adding this to all my compute nodes also ? On Fri, Mar 20, 2020 at 11:40 AM Grant Morley <grant@civo.com<mailto:grant@civo.com>> wrote: If you tune rabbit then for: heartbeat_timeout_threshold = 0 That should help with the error message you are getting. That is a lot of messages queued. We had the same because we were not using ceilometer but had the "notifications" still turned on for it for services. Are all of the ready messages for "notifications.info<https://urldefense.com/v3/__http:/notifications.info__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXCYHosug$>" for the various services ( Nova, Neutron, Keystone etc ) If that is the case you can disable those messages in your config files for those services. Look for: # Notifications [oslo_messaging_notifications] notification_topics = notifications driver = noop Make sure the driver option is set to "noop" by default it will be set too "messagingv2" and then restart the service and that should stop sending messages to the queue. You can then purge the "notifications.info<https://urldefense.com/v3/__http:/notifications.info__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXCYHosug$>" queue for Nova or Neutron etc.. We only had the "message ready" when we had the setting for ceilometer set but was not using it. Also only purge a queue if it is for that reason. Do not purge the queue if it is for any other reason than that as it can cause issues. Hope that helps. Grant On 20/03/2020 14:38, Satish Patel wrote: Grant, But i am seeing lots of following logs on my compute nodes running stein release. 2020-03-20 10:34:46.132 53425 WARNING oslo.messaging._drivers.impl_rabbit [-] Unexpected error during heartbeart thread processing, retrying...: ConnectionForced: Too many heartbeats missed. Find attached screenshot of one of rabbitMQ node, i have lots of messages in "Ready: 15339" is this looks normal to you? On Fri, Mar 20, 2020 at 4:55 AM Grant Morley <grant@civo.com<mailto:grant@civo.com>> wrote: Hi, There was a bug in Queens that meant there was an issue with the heartbeat timeouts. Setting it to 0 gets around that bug. I believe that was fixed in Rocky and above, so your Stein installation should be fine. Setting the value to 0 For us meant we stopped getting errors in the logs for: "Too many heartbeats missed, trying to force connect to RabbitMQ" Regards, On 19/03/2020 18:53, Satish Patel wrote: I have question related following setting, why are you disabling heartbeat timeout? heartbeat_timeout_threshold = 0 On Thu, Mar 19, 2020 at 1:32 PM Satish Patel <satish.txt@gmail.com<mailto:satish.txt@gmail.com>> wrote: Great, thanks! Did you guys tune your nova component for rabbitMQ? On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant@civo.com<mailto:grant@civo.com>> wrote: We left ours on the default value of 1 and that still seems to be fine. Grant On 19/03/2020 17:13, Satish Patel wrote: how about rpc_worker ? currently i have rpc_worker=1 On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com<mailto:grant@civo.com>> wrote: Correct, you need to add:

...

...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

To your Neutron nodes And you can add:

...

...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped. I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after. Regards, On 19/03/2020 16:53, Satish Patel wrote: I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly? neutron-server (container) I have 3 neutron node.

...

...
heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following:

...

...
executor_thread_pool_size = 64 rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ? On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com<mailto:grant@civo.com>> wrote: Hi Satish, You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers: executor_thread_pool_size = 64 rpc_response_timeout = 3600 Regards, On 19/03/2020 16:27, Satish Patel wrote: Erik, If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where? heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60 ## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600 On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com><mailto:eandersson@blizzard.com> wrote: We are hitting something awfully similar. We have basically been hitting a few pretty serious bugs with RabbitMQ. The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues. e.g. https://github.com/rabbitmq/rabbitmq-server/issues/641<https://urldefense.com/v3/__https:/github.com/rabbitmq/rabbitmq-server/issues/641__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSVXVk98dA$> I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/<https://urldefense.com/v3/__http:/paste.openstack.org/show/790569/__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXoXcbVVw$> - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/<https://urldefense.com/v3/__http:/paste.openstack.org/show/790570/__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSVlMPvQPw$> - Check if the bindings are bad for a specific queue. The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state. Best Regards, Erik Olof Gunnar Andersson -----Original Message----- From: Satish Patel <satish.txt@gmail.com><mailto:satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley <grant@civo.com><mailto:grant@civo.com> Cc: openstack-discuss@lists.openstack.org<mailto:openstack-discuss@lists.openstack.org> Subject: Re: Neutron RabbitMQ issues I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue. This is my favorite video, not sure you have seen this before or not but anyway posting here - https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!Ci6f514n9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvarUEZv4Mw$<https://urldefense.com/v3/__https:/www.youtube.com/watch?v=bpmgxrPOrZw__;!!Ci6f514n9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvarUEZv4Mw$> On Wed, Mar 11, 2020 at 10:24 AM Grant Morley <grant@civo.com><mailto:grant@civo.com> wrote: Hi all, We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken. We are running OpenStack Queens. We have already tuned Rabbit for Neutron by doing the following on neutron: heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60 ## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600 What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts: https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n<https://urldefense.com/v3/__https:/pastebin.com/Fjh23A5a__;!!Ci6f514n> 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$ We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync. We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues. Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back. The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available. We are also seeing a lot of rpc issues: Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately. Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse. If anyone has any use cases for this or any more recommendations that would be great. Many thanks, -- [https://www.civo.com/images/email-logo.jpg] Grant Morley Cloud Lead, Civo Ltd www.civo.com<https://urldefense.com/v3/__https:/www.civo.com/__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSVbZZtwhQ$> | Signup for an account!<https://urldefense.com/v3/__https:/www.civo.com/signup__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXboiioWg$> -- [https://www.civo.com/images/email-logo.jpg] Grant Morley Cloud Lead, Civo Ltd www.civo.com<https://urldefense.com/v3/__https:/www.civo.com/__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSVbZZtwhQ$> | Signup for an account!<https://urldefense.com/v3/__https:/www.civo.com/signup__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXboiioWg$> -- [https://www.civo.com/images/email-logo.jpg] Grant Morley Cloud Lead, Civo Ltd www.civo.com<https://urldefense.com/v3/__https:/www.civo.com/__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSVbZZtwhQ$> | Signup for an account!<https://urldefense.com/v3/__https:/www.civo.com/signup__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXboiioWg$> -- [https://www.civo.com/images/email-logo.jpg] Grant Morley Cloud Lead, Civo Ltd www.civo.com<https://urldefense.com/v3/__https:/www.civo.com/__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSVbZZtwhQ$> | Signup for an account!<https://urldefense.com/v3/__https:/www.civo.com/signup__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXboiioWg$> -- [https://www.civo.com/images/email-logo.jpg] Grant Morley Cloud Lead, Civo Ltd www.civo.com<https://urldefense.com/v3/__https:/www.civo.com/__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSVbZZtwhQ$> | Signup for an account!<https://urldefense.com/v3/__https:/www.civo.com/signup__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXboiioWg$>

Erik Olof Gunnar Andersson

5:51 p.m.

Best Regards, Erik Olof Gunnar Andersson From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 9:23 AM To: Grant Morley <grant@civo.com> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues Oh you are right here, i have following stuff in my neutron.conf on server # Notifications [oslo_messaging_notifications] driver = messagingv2 topics = notifications transport_url = rabbit://neutron:5be2a043f9a93adbd@172.28.15.192:5671<https://urldefense.com/v3/__http:/neutron:5be2a043f9a93adbd@172.28.15.192:5671__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSWmp4HJag$>,neutron:5be2a043f9a93adbd@172.28.15.248:5671<https://urldefense.com/v3/__http:/neutron:5be2a043f9a93adbd@172.28.15.248:5671__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXMR7XWRg$>,neutron:5be2a043f9a93adbd@172.28.15.22:5671//neutron?ssl=1<https://urldefense.com/v3/__http:/neutron:5be2a043f9a93adbd@172.28.15.22:5671/*neutron?ssl=1__;Lw!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSUgm8NHPA$> # Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 30 ssl = True Following change i am going to made let me know if anything missing. [DEFAULT] executor_thread_pool_size = 2048 <--- is this correct? i didn't see anywhere "rpc_thread_pool_size" rpc_response_timeout = 3600 [oslo_messaging_notifications] topics = notifications driver = noop # Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 300 heartbeat_timeout_threshold = 0 ssl = True Btw you might not necessarily be having RabbitMQ issues. You might also be experiencing something like this. https://bugs.launchpad.net/neutron/+bug/1853071 Best Regards, Erik Andersson Should i be adding this to all my compute nodes also ? On Fri, Mar 20, 2020 at 11:40 AM Grant Morley <grant@civo.com<mailto:grant@civo.com>> wrote: If you tune rabbit then for: heartbeat_timeout_threshold = 0 That should help with the error message you are getting. That is a lot of messages queued. We had the same because we were not using ceilometer but had the "notifications" still turned on for it for services. Are all of the ready messages for "notifications.info<https://urldefense.com/v3/__http:/notifications.info__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXCYHosug$>" for the various services ( Nova, Neutron, Keystone etc ) If that is the case you can disable those messages in your config files for those services. Look for: # Notifications [oslo_messaging_notifications] notification_topics = notifications driver = noop Make sure the driver option is set to "noop" by default it will be set too "messagingv2" and then restart the service and that should stop sending messages to the queue. You can then purge the "notifications.info<https://urldefense.com/v3/__http:/notifications.info__;!!Ci6f514n9QsL8ck!1JtoahRvJ7-o3UTgt10NXnicoDEJzvPdT32aMejFaeSUu2ObRSEFvEfDVSXCYHosug$>" queue for Nova or Neutron etc.. We only had the "message ready" when we had the setting for ceilometer set but was not using it. Also only purge a queue if it is for that reason. Do not purge the queue if it is for any other reason than that as it can cause issues. Hope that

Satish Patel

6:18 p.m.

Erik, That is good finding i have check on following file /openstack/venvs/neutron-19.0.0.0rc3.dev6/lib/python2.7/site-packages/neutron/objects/agent.py Do you think i should add following option and restart neutron-server? is this for all compute nodes agent or just for server? new_facade = True On Fri, Mar 20, 2020 at 1:51 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...

Best Regards, Erik Olof Gunnar Andersson

From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 9:23 AM To: Grant Morley <grant@civo.com> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

Oh you are right here, i have following stuff in my neutron.conf on server

# Notifications [oslo_messaging_notifications] driver = messagingv2 topics = notifications transport_url = rabbit://neutron:5be2a043f9a93adbd@172.28.15.192:5671,neutron:5be2a043f9a93adbd@172.28.15.248:5671,neutron:5be2a043f9a93adbd@172.28.15.22:5671//neutron?ssl=1

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 30 ssl = True

Following change i am going to made let me know if anything missing.

[DEFAULT]

executor_thread_pool_size = 2048 <--- is this correct? i didn't see anywhere "rpc_thread_pool_size"

rpc_response_timeout = 3600

[oslo_messaging_notifications] topics = notifications driver = noop

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 300

heartbeat_timeout_threshold = 0

ssl = True

Btw you might not necessarily be having RabbitMQ issues. You might also be experiencing something like this.

https://bugs.launchpad.net/neutron/+bug/1853071

Best Regards, Erik Andersson

Should i be adding this to all my compute nodes also ?

On Fri, Mar 20, 2020 at 11:40 AM Grant Morley <grant@civo.com> wrote:

If you tune rabbit then for:

heartbeat_timeout_threshold = 0

That should help with the error message you are getting.

That is a lot of messages queued. We had the same because we were not using ceilometer but had the "notifications" still turned on for it for services.

Are all of the ready messages for "notifications.info" for the various services ( Nova, Neutron, Keystone etc )

If that is the case you can disable those messages in your config files for those services. Look for:

# Notifications [oslo_messaging_notifications] notification_topics = notifications driver = noop

Make sure the driver option is set to "noop" by default it will be set too "messagingv2" and then restart the service and that should stop sending messages to the queue. You can then purge the "notifications.info" queue for Nova or Neutron etc..

We only had the "message ready" when we had the setting for ceilometer set but was not using it. Also only purge a queue if it is for that reason. Do not purge the queue if it is for any other reason than that as it can cause issues.

Hope that

Erik Olof Gunnar Andersson

7:32 p.m.

This should just be for the server afaik. I haven't tried it out myself, but we for sure have the same issue. We just scaled out the number of workers as a workaround. In fact we even added neutron-servers on VMs to handle the issue. Best Regards, Erik Olof Gunnar Andersson -----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 11:19 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Grant Morley <grant@civo.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues Erik, That is good finding i have check on following file /openstack/venvs/neutron-19.0.0.0rc3.dev6/lib/python2.7/site-packages/neutron/objects/agent.py Do you think i should add following option and restart neutron-server? is this for all compute nodes agent or just for server? new_facade = True On Fri, Mar 20, 2020 at 1:51 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...

Best Regards, Erik Olof Gunnar Andersson

From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 9:23 AM To: Grant Morley <grant@civo.com> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

Oh you are right here, i have following stuff in my neutron.conf on server

# Notifications [oslo_messaging_notifications] driver = messagingv2 topics = notifications transport_url = rabbit://neutron:5be2a043f9a93adbd@172.28.15.192:5671,neutron:5be2a043 f9a93adbd@172.28.15.248:5671,neutron:5be2a043f9a93adbd@172.28.15.22:56 71//neutron?ssl=1

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 30 ssl = True

Following change i am going to made let me know if anything missing.

[DEFAULT]

executor_thread_pool_size = 2048 <--- is this correct? i didn't see anywhere "rpc_thread_pool_size"

rpc_response_timeout = 3600

[oslo_messaging_notifications] topics = notifications driver = noop

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 300

heartbeat_timeout_threshold = 0

ssl = True

Btw you might not necessarily be having RabbitMQ issues. You might also be experiencing something like this.

https://urldefense.com/v3/__https://bugs.launchpad.net/neutron/*bug/18 53071__;Kw!!Ci6f514n9QsL8ck!zYBaONMbYxgOLiv4UJY51DLOI4H-qHjCOACdH6inbj e694WzyxY-Eqpyl5QtywV6BQ$

Best Regards, Erik Andersson

Should i be adding this to all my compute nodes also ?

On Fri, Mar 20, 2020 at 11:40 AM Grant Morley <grant@civo.com> wrote:

If you tune rabbit then for:

heartbeat_timeout_threshold = 0

That should help with the error message you are getting.

That is a lot of messages queued. We had the same because we were not using ceilometer but had the "notifications" still turned on for it for services.

Are all of the ready messages for "notifications.info" for the various services ( Nova, Neutron, Keystone etc )

If that is the case you can disable those messages in your config files for those services. Look for:

# Notifications [oslo_messaging_notifications] notification_topics = notifications driver = noop

Make sure the driver option is set to "noop" by default it will be set too "messagingv2" and then restart the service and that should stop sending messages to the queue. You can then purge the "notifications.info" queue for Nova or Neutron etc..

We only had the "message ready" when we had the setting for ceilometer set but was not using it. Also only purge a queue if it is for that reason. Do not purge the queue if it is for any other reason than that as it can cause issues.

Hope that

Satish Patel

7:46 p.m.

Do you think i should try to add following option by hand "new_facade = True" on server and restart neutron-server services? I am not seeing any extra dependency for that option so looks very simple to add. When you said scaled out number of workers means you added multiple neutron-servers on bunch of VM to spread load? (I have 3x controller nodes running on physical servers) On Fri, Mar 20, 2020 at 3:32 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...

This should just be for the server afaik. I haven't tried it out myself, but we for sure have the same issue. We just scaled out the number of workers as a workaround. In fact we even added neutron-servers on VMs to handle the issue.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 11:19 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Grant Morley <grant@civo.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

Erik,

That is good finding i have check on following file

/openstack/venvs/neutron-19.0.0.0rc3.dev6/lib/python2.7/site-packages/neutron/objects/agent.py

Do you think i should add following option and restart neutron-server? is this for all compute nodes agent or just for server?

new_facade = True

On Fri, Mar 20, 2020 at 1:51 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...
Best Regards, Erik Olof Gunnar Andersson

From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 9:23 AM To: Grant Morley <grant@civo.com> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

Oh you are right here, i have following stuff in my neutron.conf on server

# Notifications [oslo_messaging_notifications] driver = messagingv2 topics = notifications transport_url = rabbit://neutron:5be2a043f9a93adbd@172.28.15.192:5671,neutron:5be2a043 f9a93adbd@172.28.15.248:5671,neutron:5be2a043f9a93adbd@172.28.15.22:56 71//neutron?ssl=1

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 30 ssl = True

Following change i am going to made let me know if anything missing.

[DEFAULT]

executor_thread_pool_size = 2048 <--- is this correct? i didn't see anywhere "rpc_thread_pool_size"

rpc_response_timeout = 3600

[oslo_messaging_notifications] topics = notifications driver = noop

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 300

heartbeat_timeout_threshold = 0

ssl = True

Btw you might not necessarily be having RabbitMQ issues. You might also be experiencing something like this.

https://urldefense.com/v3/__https://bugs.launchpad.net/neutron/*bug/18 53071__;Kw!!Ci6f514n9QsL8ck!zYBaONMbYxgOLiv4UJY51DLOI4H-qHjCOACdH6inbj e694WzyxY-Eqpyl5QtywV6BQ$

Best Regards, Erik Andersson

Should i be adding this to all my compute nodes also ?

On Fri, Mar 20, 2020 at 11:40 AM Grant Morley <grant@civo.com> wrote:

If you tune rabbit then for:

heartbeat_timeout_threshold = 0

That should help with the error message you are getting.

That is a lot of messages queued. We had the same because we were not using ceilometer but had the "notifications" still turned on for it for services.

Are all of the ready messages for "notifications.info" for the various services ( Nova, Neutron, Keystone etc )

If that is the case you can disable those messages in your config files for those services. Look for:

# Notifications [oslo_messaging_notifications] notification_topics = notifications driver = noop

Make sure the driver option is set to "noop" by default it will be set too "messagingv2" and then restart the service and that should stop sending messages to the queue. You can then purge the "notifications.info" queue for Nova or Neutron etc..

We only had the "message ready" when we had the setting for ceilometer set but was not using it. Also only purge a queue if it is for that reason. Do not purge the queue if it is for any other reason than that as it can cause issues.

Hope that

Erik Olof Gunnar Andersson

7:52 p.m.

When I say scale up I mean that yea. Plus of course bumping the number of workers (rpc_workers) to an appropriate value. There are risks with modifying that. Might be worth asking in the neutron channel on irc, at least if this is your production deployment. If possible maybe test it in a lab or staging deployment first. Best Regards, Erik Olof Gunnar Andersson -----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 12:47 PM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Grant Morley <grant@civo.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues Do you think i should try to add following option by hand "new_facade = True" on server and restart neutron-server services? I am not seeing any extra dependency for that option so looks very simple to add. When you said scaled out number of workers means you added multiple neutron-servers on bunch of VM to spread load? (I have 3x controller nodes running on physical servers) On Fri, Mar 20, 2020 at 3:32 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...

This should just be for the server afaik. I haven't tried it out myself, but we for sure have the same issue. We just scaled out the number of workers as a workaround. In fact we even added neutron-servers on VMs to handle the issue.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 11:19 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Grant Morley <grant@civo.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

Erik,

That is good finding i have check on following file

/openstack/venvs/neutron-19.0.0.0rc3.dev6/lib/python2.7/site-packages/ neutron/objects/agent.py

Do you think i should add following option and restart neutron-server? is this for all compute nodes agent or just for server?

new_facade = True

On Fri, Mar 20, 2020 at 1:51 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...
Best Regards, Erik Olof Gunnar Andersson

From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 9:23 AM To: Grant Morley <grant@civo.com> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

Oh you are right here, i have following stuff in my neutron.conf on server

# Notifications [oslo_messaging_notifications] driver = messagingv2 topics = notifications transport_url = rabbit://neutron:5be2a043f9a93adbd@172.28.15.192:5671,neutron:5be2a0 43 f9a93adbd@172.28.15.248:5671,neutron:5be2a043f9a93adbd@172.28.15.22: 56 71//neutron?ssl=1

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 30 ssl = True

Following change i am going to made let me know if anything missing.

[DEFAULT]

executor_thread_pool_size = 2048 <--- is this correct? i didn't see anywhere "rpc_thread_pool_size"

rpc_response_timeout = 3600

[oslo_messaging_notifications] topics = notifications driver = noop

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 300

heartbeat_timeout_threshold = 0

ssl = True

Btw you might not necessarily be having RabbitMQ issues. You might also be experiencing something like this.

https://urldefense.com/v3/__https://bugs.launchpad.net/neutron/*bug/ 18 53071__;Kw!!Ci6f514n9QsL8ck!zYBaONMbYxgOLiv4UJY51DLOI4H-qHjCOACdH6in bj e694WzyxY-Eqpyl5QtywV6BQ$

Best Regards, Erik Andersson

Should i be adding this to all my compute nodes also ?

On Fri, Mar 20, 2020 at 11:40 AM Grant Morley <grant@civo.com> wrote:

If you tune rabbit then for:

heartbeat_timeout_threshold = 0

That should help with the error message you are getting.

That is a lot of messages queued. We had the same because we were not using ceilometer but had the "notifications" still turned on for it for services.

Are all of the ready messages for "notifications.info" for the various services ( Nova, Neutron, Keystone etc )

If that is the case you can disable those messages in your config files for those services. Look for:

# Notifications [oslo_messaging_notifications] notification_topics = notifications driver = noop

Make sure the driver option is set to "noop" by default it will be set too "messagingv2" and then restart the service and that should stop sending messages to the queue. You can then purge the "notifications.info" queue for Nova or Neutron etc..

We only had the "message ready" when we had the setting for ceilometer set but was not using it. Also only purge a queue if it is for that reason. Do not purge the queue if it is for any other reason than that as it can cause issues.

Hope that

Satish Patel

8 p.m.

Sure, I will do that. Thanks On Fri, Mar 20, 2020 at 3:52 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...

When I say scale up I mean that yea. Plus of course bumping the number of workers (rpc_workers) to an appropriate value.

There are risks with modifying that. Might be worth asking in the neutron channel on irc, at least if this is your production deployment. If possible maybe test it in a lab or staging deployment first.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 12:47 PM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Grant Morley <grant@civo.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

Do you think i should try to add following option by hand "new_facade = True" on server and restart neutron-server services? I am not seeing any extra dependency for that option so looks very simple to add.

When you said scaled out number of workers means you added multiple neutron-servers on bunch of VM to spread load? (I have 3x controller nodes running on physical servers)

On Fri, Mar 20, 2020 at 3:32 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...
This should just be for the server afaik. I haven't tried it out myself, but we for sure have the same issue. We just scaled out the number of workers as a workaround. In fact we even added neutron-servers on VMs to handle the issue.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 11:19 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: Grant Morley <grant@civo.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

Erik,

That is good finding i have check on following file

/openstack/venvs/neutron-19.0.0.0rc3.dev6/lib/python2.7/site-packages/ neutron/objects/agent.py

Do you think i should add following option and restart neutron-server? is this for all compute nodes agent or just for server?

new_facade = True

On Fri, Mar 20, 2020 at 1:51 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:

...
Best Regards, Erik Olof Gunnar Andersson

From: Satish Patel <satish.txt@gmail.com> Sent: Friday, March 20, 2020 9:23 AM To: Grant Morley <grant@civo.com> Cc: Erik Olof Gunnar Andersson <eandersson@blizzard.com>; openstack-discuss@lists.openstack.org Subject: Re: Neutron RabbitMQ issues

Oh you are right here, i have following stuff in my neutron.conf on server

# Notifications [oslo_messaging_notifications] driver = messagingv2 topics = notifications transport_url = rabbit://neutron:5be2a043f9a93adbd@172.28.15.192:5671,neutron:5be2a0 43 f9a93adbd@172.28.15.248:5671,neutron:5be2a043f9a93adbd@172.28.15.22: 56 71//neutron?ssl=1

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 30 ssl = True

Following change i am going to made let me know if anything missing.

[DEFAULT]

executor_thread_pool_size = 2048 <--- is this correct? i didn't see anywhere "rpc_thread_pool_size"

rpc_response_timeout = 3600

[oslo_messaging_notifications] topics = notifications driver = noop

# Messaging [oslo_messaging_rabbit] rpc_conn_pool_size = 300

heartbeat_timeout_threshold = 0

ssl = True

Btw you might not necessarily be having RabbitMQ issues. You might also be experiencing something like this.

https://urldefense.com/v3/__https://bugs.launchpad.net/neutron/*bug/ 18 53071__;Kw!!Ci6f514n9QsL8ck!zYBaONMbYxgOLiv4UJY51DLOI4H-qHjCOACdH6in bj e694WzyxY-Eqpyl5QtywV6BQ$

Best Regards, Erik Andersson

Should i be adding this to all my compute nodes also ?

On Fri, Mar 20, 2020 at 11:40 AM Grant Morley <grant@civo.com> wrote:

If you tune rabbit then for:

heartbeat_timeout_threshold = 0

That should help with the error message you are getting.

That is a lot of messages queued. We had the same because we were not using ceilometer but had the "notifications" still turned on for it for services.

Are all of the ready messages for "notifications.info" for the various services ( Nova, Neutron, Keystone etc )

If that is the case you can disable those messages in your config files for those services. Look for:

# Notifications [oslo_messaging_notifications] notification_topics = notifications driver = noop

Make sure the driver option is set to "noop" by default it will be set too "messagingv2" and then restart the service and that should stop sending messages to the queue. You can then purge the "notifications.info" queue for Nova or Neutron etc..

We only had the "message ready" when we had the setting for ceilometer set but was not using it. Also only purge a queue if it is for that reason. Do not purge the queue if it is for any other reason than that as it can cause issues.

Hope that

Grant Morley

8:56 a.m.

Hi, We personally did not tune Nova for RabbitMQ, we found that it was doing a good enough job for us. Grant On 19/03/2020 17:32, Satish Patel wrote:

...

Great, thanks! Did you guys tune your nova component for rabbitMQ?

On Thu, Mar 19, 2020 at 1:26 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

We left ours on the default value of 1 and that still seems to be fine.

Grant

On 19/03/2020 17:13, Satish Patel wrote:

...
how about rpc_worker ? currently i have rpc_worker=1

On Thu, Mar 19, 2020 at 1:02 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Correct, you need to add:

> > heartbeat_timeout_threshold = 0 > > rpc_conn_pool_size = 300 > > rpc_thread_pool_size = 2048 > > rpc_response_timeout = 3600 > > rpc_poll_timeout = 60

To your Neutron nodes

And you can add:

>> executor_thread_pool_size = 64 >> rpc_response_timeout = 3600

To your compute nodes (neutron.conf) However I found just adding the changes to the neturon servers really helped.

I would recommend just starting with your neutron nodes first to see if that helps. If you find your compute nodes are still having issues then change the settings on those after.

Regards,

On 19/03/2020 16:53, Satish Patel wrote:

...
I am running openstack-ansible (Queens / Stein both) so this is what i am going to do, am i doing correctly?

neutron-server (container) I have 3 neutron node. > > heartbeat_timeout_threshold = 0 > > rpc_conn_pool_size = 300 > > rpc_thread_pool_size = 2048 > > rpc_response_timeout = 3600 > > rpc_poll_timeout = 60

330 compute nodes (agent neutron.conf) going to add following: >> executor_thread_pool_size = 64 >> rpc_response_timeout = 3600

How about nova? should i be doing that on nova as well to reduce load on rabbitMQ?

On Thu, Mar 19, 2020 at 12:35 PM Grant Morley <grant@civo.com <mailto:grant@civo.com>> wrote:

Hi Satish,

You will need to add those to the "neutron.conf" file on your network nodes. If you are running OS-A I would do it on your "neutron-server" nodes and add the following to your agents containers:

executor_thread_pool_size = 64 rpc_response_timeout = 3600

Regards,

On 19/03/2020 16:27, Satish Patel wrote:

...
Erik,

If i want to adopt following setting then where i should add them in Queens openstack, neutron-server or all my compute nodes? which setting will go where?

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

On Wed, Mar 11, 2020 at 9:05 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> <mailto:eandersson@blizzard.com> wrote:

...
We are hitting something awfully similar.

We have basically been hitting a few pretty serious bugs with RabbitMQ.

The main one is when a RabbitMQ server crashes, or gets split brain it does not always recover, or even when just one node is restarted. We sometimes end up with orphaned consumers that keep consuming messages, but goes to /dev/null pretty much. Another issue is that sometimes bindings stop working. They are visually there, but simply does not route traffic to the intended queues.

e.g.https://github.com/rabbitmq/rabbitmq-server/issues/641

I wrote two quick scripts to audit these issues. http://paste.openstack.org/show/790569/ - Check if you have orphaned consumers (may need pagination if you have a large deployment). http://paste.openstack.org/show/790570/ - Check if the bindings are bad for a specific queue.

The main issue seems to be the number of queues + connections causing the recovery after restarting a node to cause bindings and/or queues to get into an "orphaned" state.

Best Regards, Erik Olof Gunnar Andersson

-----Original Message----- From: Satish Patel<satish.txt@gmail.com> <mailto:satish.txt@gmail.com> Sent: Wednesday, March 11, 2020 5:14 PM To: Grant Morley<grant@civo.com> <mailto:grant@civo.com> Cc:openstack-discuss@lists.openstack.org <mailto:openstack-discuss@lists.openstack.org> Subject: Re: Neutron RabbitMQ issues

I am also dealing with some short of rabbitmq performance issue but its not as worst you your issue.

This is my favorite video, not sure you have seen this before or not but anyway posting here -https://urldefense.com/v3/__https://www.youtube.com/watch?v=bpmgxrPOrZw__;!!...

On Wed, Mar 11, 2020 at 10:24 AM Grant Morley<grant@civo.com> <mailto:grant@civo.com> wrote:

...
Hi all,

We are currently experiencing some fairly major issues with our OpenStack cluster. It all appears to be with Neutron and RabbitMQ. We are seeing a lot of time out messages in responses to replies and because of this instance creation or anything to do with instances and networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0 rpc_conn_pool_size = 300 rpc_thread_pool_size = 2048 rpc_response_timeout = 3600 rpc_poll_timeout = 60

## Rpc all executor_thread_pool_size = 64 rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services (l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://urldefense.com/v3/__https://pastebin.com/Fjh23A5a__;!!Ci6f514n 9QsL8ck!1rOR_L7ya6zmMgZ0owpfO7NvhsPOzbgyUplonob2awcg8hd80yCAT_ynvapLQK 9aOA$

We have manually tried to get everything in sync by forcing fail-over of the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4 2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc [req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC method release_dhcp_port. Waiting for 3347 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.: MessagingTimeout: Timed out waiting for a reply to message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading for us at the moment to a newer version isn't really an option unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I know that will be putting a load on the system. However these problems have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that would be great.

Many thanks,

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

--

Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

-- Grant Morley Cloud Lead, Civo Ltd www.civo.com <https://www.civo.com/>| Signup for an account! <https://www.civo.com/signup>

2072

Age (days ago)

2082

Last active (days ago)

List overview

Download

25 comments

3 participants

participants (3)

Erik Olof Gunnar Andersson
Grant Morley
Satish Patel