Neutron metadata service not responding due to rabbitmq dropping messages

24 Feb 2020

      Hi all,

We have recently come across an issue where our metadata service stops 
responding. If you try to curl the service from within an instance you get:

% curl http://169.254.169.254
<html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>

After doing some digging around on our neutron nodes I noticed we were 
getting loads of RabbitMQ timeout errors whilst trying to process 
message requests:

2020-02-24 07:28:09.747 26378 ERROR neutron.common.rpc [-] Timeout in 
RPC method get_ports. Waiting for 26 seconds before next attempt. If the 
server is not down, consider increasing the rpc_response_timeout option 
as Neutron server(s) may be overloaded and unable to respond quickly 
enough.: MessagingTimeout: Timed out waiting for a reply to message ID 
a14c4a1395864cd980c1ec563a5c48aa

The servers are fairly busy, however we do not have a massive 
installation >1500 instances and roughly 850 routers.

However if I restart the "neutron-metadata-agent" service and the 
"neutron-server" service it seems to fix the issue for a while but 
ultimately it comes back.

I did increase the "rpc_timeout" on the netutron nodes to 120 seconds 
but that seems quite long to me.

Likewise the RabbitMQ servers are not overly busy, we seem to get a 
constant stream of only 40+ messages in the queue at one time and that 
can spike depending on workload.

Does anyone know of any tuning or tweaking we can do to the metadata 
service in either Neutron or Nova that might help?

We are running OpenStack Queens if that helps.

Many thanks,

Grant

Neutron metadata service not responding due to rabbitmq dropping messages

Grant Morley