I have a few operational suggestions on how I think we could do this best: 1. I think exposing a healthcheck endpoint that _actually_ runs the ping and responds with a 200 OK makes a lot more sense in terms of being able to run it inside something like Kubernetes, you end up with a "who makes the ping and who responds to it" type of scenario which can be tricky though I'm sure we can figure that out 2. I've found that newer releases of RabbitMQ really help with those un-usable queues after a split, I haven't had any issues at all with newer releases, so that could be something to help your life be a lot easier. 3. You mentioned you're moving towards Kubernetes, we're doing the same and building an operator: https://opendev.org/vexxhost/openstack-operator -- Because the operator manages the whole thing and Kubernetes does it's thing too, we started moving towards 1 (single) rabbitmq per service, which reaaaaaaally helped a lot in stabilizing things. Oslo messaging is a lot better at recovering when a single service IP is pointing towards it because it doesn't do weird things like have threads trying to connect to other Rabbit ports. Just a thought. 4. In terms of telemetry and making sure you avoid that issue, we track the consumption rates of queues inside OpenStack. OpenStack consumption rate should be constant and never growing, anytime it grows, we instantly detect that something is fishy. However, the other issue comes in that when you restart any openstack service, it 'forgets' all it's existing queues and then you have a set of building up queues until they automatically expire which happens around 30 minutes-ish, so it makes that alarm of "things are not being consumed" a little noisy if you're restarting services Sorry for the wall of super unorganized text, all over the place here but thought I'd chime in with my 2 cents :) On Mon, Jul 27, 2020 at 6:04 AM Arnaud Morin <arnaud.morin@gmail.com> wrote:
Hey all,
TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC, this is useful to monitor liveness of agents.
Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a ping endpoint to RPC dispatcher. It means that every openstack service which is using oslo_messaging RPC endpoints (almosts all OpenStack services and agents - e.g. neutron server + agents, nova + computes, etc.) will then be able to answer to a specific "ping" call over RPC.
I decided to propose this patch in my company mainly for 2 reasons: 1 - we are struggling monitoring our nova compute and neutron agents in a correct way:
1.1 - sometimes our agents are disconnected from RPC, but the python process is still running. 1.2 - sometimes the agent is still connected, but the queue / binding on rabbit cluster is not working anymore (after a rabbit split for example). This one is very hard to debug, because the agent is still reporting health correctly on neutron server, but it's not able to receive messages anymore.
2 - we are trying to monitor agents running in k8s pods: when running a python agent (neutron l3-agent for example) in a k8s pod, we wanted to find a way to monitor if it is still live of not.
Adding a RPC ping endpoint could help us solve both these issues. Note that we still need an external mechanism (out of OpenStack) to do this ping. We also think it could be nice for other OpenStackers, and especially large scale ops.
Feel free to comment.
[1] https://review.opendev.org/#/c/735385/
-- Arnaud Morin
-- Mohammed Naser VEXXHOST, Inc.