[largescale-sig] RPC ping

Mohammed Naser mnaser at vexxhost.com
Mon Aug 3 14:21:33 UTC 2020


I have a few operational suggestions on how I think we could do this best:

1. I think exposing a healthcheck endpoint that _actually_ runs the
ping and responds with a 200 OK makes a lot more sense in terms of
being able to run it inside something like Kubernetes, you end up with
a "who makes the ping and who responds to it" type of scenario which
can be tricky though I'm sure we can figure that out
2. I've found that newer releases of RabbitMQ really help with those
un-usable queues after a split, I haven't had any issues at all with
newer releases, so that could be something to help your life be a lot
easier.
3. You mentioned you're moving towards Kubernetes, we're doing the
same and building an operator:
https://opendev.org/vexxhost/openstack-operator -- Because the
operator manages the whole thing and Kubernetes does it's thing too,
we started moving towards 1 (single) rabbitmq per service, which
reaaaaaaally helped a lot in stabilizing things.  Oslo messaging is a
lot better at recovering when a single service IP is pointing towards
it because it doesn't do weird things like have threads trying to
connect to other Rabbit ports.  Just a thought.
4. In terms of telemetry and making sure you avoid that issue, we
track the consumption rates of queues inside OpenStack.  OpenStack
consumption rate should be constant and never growing, anytime it
grows, we instantly detect that something is fishy.  However, the
other issue comes in that when you restart any openstack service, it
'forgets' all it's existing queues and then you have a set of building
up queues until they automatically expire which happens around 30
minutes-ish, so it makes that alarm of "things are not being consumed"
a little noisy if you're restarting services

Sorry for the wall of super unorganized text, all over the place here
but thought I'd chime in with my 2 cents :)

On Mon, Jul 27, 2020 at 6:04 AM Arnaud Morin <arnaud.morin at gmail.com> wrote:
>
> Hey all,
>
> TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC,
>       this is useful to monitor liveness of agents.
>
>
> Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a
> ping endpoint to RPC dispatcher.
> It means that every openstack service which is using oslo_messaging RPC
> endpoints (almosts all OpenStack services and agents - e.g. neutron
> server + agents, nova + computes, etc.) will then be able to answer to a
> specific "ping" call over RPC.
>
> I decided to propose this patch in my company mainly for 2 reasons:
> 1 - we are struggling monitoring our nova compute and neutron agents in a
>   correct way:
>
> 1.1 - sometimes our agents are disconnected from RPC, but the python process
> is still running.
> 1.2 - sometimes the agent is still connected, but the queue / binding on
> rabbit cluster is not working anymore (after a rabbit split for
> example). This one is very hard to debug, because the agent is still
> reporting health correctly on neutron server, but it's not able to
> receive messages anymore.
>
>
> 2 - we are trying to monitor agents running in k8s pods:
> when running a python agent (neutron l3-agent for example) in a k8s pod, we
> wanted to find a way to monitor if it is still live of not.
>
>
> Adding a RPC ping endpoint could help us solve both these issues.
> Note that we still need an external mechanism (out of OpenStack) to do this
> ping.
> We also think it could be nice for other OpenStackers, and especially
> large scale ops.
>
> Feel free to comment.
>
>
> [1] https://review.opendev.org/#/c/735385/
>
>
> --
> Arnaud Morin
>
>


-- 
Mohammed Naser
VEXXHOST, Inc.



More information about the openstack-discuss mailing list