Neutron RabbitMQ issues

Grant Morley grant at civo.com
Tue Mar 10 19:18:58 UTC 2020


Hi all,

We are currently experiencing some fairly major issues with our 
OpenStack cluster. It all appears to be with Neutron and RabbitMQ.  We 
are seeing a lot of time out messages in responses to replies and 
because of this instance creation or anything to do with instances and 
networking is broken.

We are running OpenStack Queens.

We have already tuned Rabbit for Neutron by doing the following on neutron:

heartbeat_timeout_threshold = 0
rpc_conn_pool_size = 300
rpc_thread_pool_size = 2048
rpc_response_timeout = 3600
rpc_poll_timeout = 60

## Rpc all
executor_thread_pool_size = 64
rpc_response_timeout = 3600

What we are seeing in the error logs for neutron for all services 
(l3-agent, dhcp, linux-bridge etc ) are these timeouts:

https://pastebin.com/Fjh23A5a

We have manually tried to get everything in sync by forcing fail-over of 
the networking which seems to get routers in sync.

We are also seeing that there are a lot of "unacknowledged" messages in 
RabbitMQ for 'q-plugin' in the neutron queues.

Some times restarting of the services on neutron gets these back 
acknowledged again, however the timeouts come back.

The RabbitMQ servers themselves are not loaded at all. All memory, file 
descriptors and errlang processes have plenty of resources available.

We are also seeing a lot of rpc issues:

Timeout in RPC method release_dhcp_port. Waiting for 1523 seconds before 
next attempt. If the server is not down, consider increasing the 
rpc_response_timeout option as Neutron server(s) may be overloaded and 
unable to respond quickly enough.: MessagingTimeout: Timed out waiting 
for a reply to message ID 965fa44ab4f6462fa378a1cf7259aad4
2020-03-10 19:02:33.548 16242 ERROR neutron.common.rpc 
[req-a858afbb-5083-4e21-a309-6ee53582c4d9 - - - - -] Timeout in RPC 
method release_dhcp_port. Waiting for 3347 seconds before next attempt. 
If the server is not down, consider increasing the rpc_response_timeout 
option as Neutron server(s) may be overloaded and unable to respond 
quickly enough.: MessagingTimeout: Timed out waiting for a reply to 
message ID 7937465f15634fbfa443fe1758a12a9c

Does anyone know if there is anymore tuning to be done at all? Upgrading 
for us at the moment to a newer version isn't really an option 
unfortunately.

Because of our setup, we also have roughly 800 routers enabled and I 
know that will be putting a load on the system. However these problems 
have only started to happen roughly 1 week ago and have steadily got worse.

If anyone has any use cases for this or any more recommendations that 
would be great.

Many thanks,




More information about the openstack-discuss mailing list