Tagging with Nova and Neutron as they are mentioned and I thought some people from those teams had opinions on this. Can you refresh my memory on why we dropped this before? I recall talking about it in Denver, but I can't for the life of me remember what the conclusion was. Did we intend to use something else for this that has since fallen through? On 7/27/20 4:57 AM, Arnaud Morin wrote:
Hey all,
TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC, this is useful to monitor liveness of agents.
Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a ping endpoint to RPC dispatcher. It means that every openstack service which is using oslo_messaging RPC endpoints (almosts all OpenStack services and agents - e.g. neutron server + agents, nova + computes, etc.) will then be able to answer to a specific "ping" call over RPC.
I decided to propose this patch in my company mainly for 2 reasons: 1 - we are struggling monitoring our nova compute and neutron agents in a correct way:
1.1 - sometimes our agents are disconnected from RPC, but the python process is still running. 1.2 - sometimes the agent is still connected, but the queue / binding on rabbit cluster is not working anymore (after a rabbit split for example). This one is very hard to debug, because the agent is still reporting health correctly on neutron server, but it's not able to receive messages anymore.
2 - we are trying to monitor agents running in k8s pods: when running a python agent (neutron l3-agent for example) in a k8s pod, we wanted to find a way to monitor if it is still live of not.
Adding a RPC ping endpoint could help us solve both these issues. Note that we still need an external mechanism (out of OpenStack) to do this ping. We also think it could be nice for other OpenStackers, and especially large scale ops.
Feel free to comment.