Open Stack

Thu Aug 6 14:11:32 UTC 2020

Hi Mohammed,

1 - That's something we would also like, but it's beyond the patch I
propose.
I need this patch not only for kubernetes, but also for monitoring my
legagy openstack agents running outside of k8s.

2 - Yes, latest version of rabbitmq is better on that point, but we
still see some weird issue (I will ask the community about it in another
topic).

3 - Thanks for this operator, we'll take a look!
By saying 1 rabbit per service, I understand 1 server, not 1 cluster,
right?
That sounds risky if you lose the server.

I suppose you dont do that for the database?

4 - Nice, how to you monitor those consumptions? Using rabbit management
API?

Cheers,

-- 
Arnaud Morin

On 03.08.20 - 10:21, Mohammed Naser wrote:
> I have a few operational suggestions on how I think we could do this best:
> 
> 1. I think exposing a healthcheck endpoint that _actually_ runs the
> ping and responds with a 200 OK makes a lot more sense in terms of
> being able to run it inside something like Kubernetes, you end up with
> a "who makes the ping and who responds to it" type of scenario which
> can be tricky though I'm sure we can figure that out
> 2. I've found that newer releases of RabbitMQ really help with those
> un-usable queues after a split, I haven't had any issues at all with
> newer releases, so that could be something to help your life be a lot
> easier.
> 3. You mentioned you're moving towards Kubernetes, we're doing the
> same and building an operator:
> https://opendev.org/vexxhost/openstack-operator -- Because the
> operator manages the whole thing and Kubernetes does it's thing too,
> we started moving towards 1 (single) rabbitmq per service, which
> reaaaaaaally helped a lot in stabilizing things.  Oslo messaging is a
> lot better at recovering when a single service IP is pointing towards
> it because it doesn't do weird things like have threads trying to
> connect to other Rabbit ports.  Just a thought.
> 4. In terms of telemetry and making sure you avoid that issue, we
> track the consumption rates of queues inside OpenStack.  OpenStack
> consumption rate should be constant and never growing, anytime it
> grows, we instantly detect that something is fishy.  However, the
> other issue comes in that when you restart any openstack service, it
> 'forgets' all it's existing queues and then you have a set of building
> up queues until they automatically expire which happens around 30
> minutes-ish, so it makes that alarm of "things are not being consumed"
> a little noisy if you're restarting services
> 
> Sorry for the wall of super unorganized text, all over the place here
> but thought I'd chime in with my 2 cents :)
> 
> On Mon, Jul 27, 2020 at 6:04 AM Arnaud Morin <arnaud.morin at gmail.com> wrote:
> >
> > Hey all,
> >
> > TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC,
> >       this is useful to monitor liveness of agents.
> >
> >
> > Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a
> > ping endpoint to RPC dispatcher.
> > It means that every openstack service which is using oslo_messaging RPC
> > endpoints (almosts all OpenStack services and agents - e.g. neutron
> > server + agents, nova + computes, etc.) will then be able to answer to a
> > specific "ping" call over RPC.
> >
> > I decided to propose this patch in my company mainly for 2 reasons:
> > 1 - we are struggling monitoring our nova compute and neutron agents in a
> >   correct way:
> >
> > 1.1 - sometimes our agents are disconnected from RPC, but the python process
> > is still running.
> > 1.2 - sometimes the agent is still connected, but the queue / binding on
> > rabbit cluster is not working anymore (after a rabbit split for
> > example). This one is very hard to debug, because the agent is still
> > reporting health correctly on neutron server, but it's not able to
> > receive messages anymore.
> >
> >
> > 2 - we are trying to monitor agents running in k8s pods:
> > when running a python agent (neutron l3-agent for example) in a k8s pod, we
> > wanted to find a way to monitor if it is still live of not.
> >
> >
> > Adding a RPC ping endpoint could help us solve both these issues.
> > Note that we still need an external mechanism (out of OpenStack) to do this
> > ping.
> > We also think it could be nice for other OpenStackers, and especially
> > large scale ops.
> >
> > Feel free to comment.
> >
> >
> > [1] https://review.opendev.org/#/c/735385/
> >
> >
> > --
> > Arnaud Morin
> >
> >
> 
> 
> -- 
> Mohammed Naser
> VEXXHOST, Inc.

Open Stack

[largescale-sig] RPC ping

OpenStack

Community

Documentation

Branding & Legal