[largescale-sig] RPC ping
Mohammed Naser
mnaser at vexxhost.com
Wed Aug 12 14:22:53 UTC 2020
On Thu, Aug 6, 2020 at 10:11 AM Arnaud Morin <arnaud.morin at gmail.com> wrote:
>
> Hi Mohammed,
>
> 1 - That's something we would also like, but it's beyond the patch I
> propose.
> I need this patch not only for kubernetes, but also for monitoring my
> legagy openstack agents running outside of k8s.
>
> 2 - Yes, latest version of rabbitmq is better on that point, but we
> still see some weird issue (I will ask the community about it in another
> topic).
>
> 3 - Thanks for this operator, we'll take a look!
> By saying 1 rabbit per service, I understand 1 server, not 1 cluster,
> right?
> That sounds risky if you lose the server.
The controllers are pretty stable and if a controller dies, Kubernetes
will take care of restarting the pod somewhere else and everything
will reconnect and things will be happy again.
> I suppose you dont do that for the database?
One database cluster per service, with 'old-school' replication
because no one really does true multimaster in Galera with OpenStack
anyways.
> 4 - Nice, how to you monitor those consumptions? Using rabbit management
> API?
Prometheus RabbitMQ exporter, now migrating to the native one shipping
in the new RabbitMQ releases.
> Cheers,
>
> --
> Arnaud Morin
>
> On 03.08.20 - 10:21, Mohammed Naser wrote:
> > I have a few operational suggestions on how I think we could do this best:
> >
> > 1. I think exposing a healthcheck endpoint that _actually_ runs the
> > ping and responds with a 200 OK makes a lot more sense in terms of
> > being able to run it inside something like Kubernetes, you end up with
> > a "who makes the ping and who responds to it" type of scenario which
> > can be tricky though I'm sure we can figure that out
> > 2. I've found that newer releases of RabbitMQ really help with those
> > un-usable queues after a split, I haven't had any issues at all with
> > newer releases, so that could be something to help your life be a lot
> > easier.
> > 3. You mentioned you're moving towards Kubernetes, we're doing the
> > same and building an operator:
> > https://opendev.org/vexxhost/openstack-operator -- Because the
> > operator manages the whole thing and Kubernetes does it's thing too,
> > we started moving towards 1 (single) rabbitmq per service, which
> > reaaaaaaally helped a lot in stabilizing things. Oslo messaging is a
> > lot better at recovering when a single service IP is pointing towards
> > it because it doesn't do weird things like have threads trying to
> > connect to other Rabbit ports. Just a thought.
> > 4. In terms of telemetry and making sure you avoid that issue, we
> > track the consumption rates of queues inside OpenStack. OpenStack
> > consumption rate should be constant and never growing, anytime it
> > grows, we instantly detect that something is fishy. However, the
> > other issue comes in that when you restart any openstack service, it
> > 'forgets' all it's existing queues and then you have a set of building
> > up queues until they automatically expire which happens around 30
> > minutes-ish, so it makes that alarm of "things are not being consumed"
> > a little noisy if you're restarting services
> >
> > Sorry for the wall of super unorganized text, all over the place here
> > but thought I'd chime in with my 2 cents :)
> >
> > On Mon, Jul 27, 2020 at 6:04 AM Arnaud Morin <arnaud.morin at gmail.com> wrote:
> > >
> > > Hey all,
> > >
> > > TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC,
> > > this is useful to monitor liveness of agents.
> > >
> > >
> > > Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a
> > > ping endpoint to RPC dispatcher.
> > > It means that every openstack service which is using oslo_messaging RPC
> > > endpoints (almosts all OpenStack services and agents - e.g. neutron
> > > server + agents, nova + computes, etc.) will then be able to answer to a
> > > specific "ping" call over RPC.
> > >
> > > I decided to propose this patch in my company mainly for 2 reasons:
> > > 1 - we are struggling monitoring our nova compute and neutron agents in a
> > > correct way:
> > >
> > > 1.1 - sometimes our agents are disconnected from RPC, but the python process
> > > is still running.
> > > 1.2 - sometimes the agent is still connected, but the queue / binding on
> > > rabbit cluster is not working anymore (after a rabbit split for
> > > example). This one is very hard to debug, because the agent is still
> > > reporting health correctly on neutron server, but it's not able to
> > > receive messages anymore.
> > >
> > >
> > > 2 - we are trying to monitor agents running in k8s pods:
> > > when running a python agent (neutron l3-agent for example) in a k8s pod, we
> > > wanted to find a way to monitor if it is still live of not.
> > >
> > >
> > > Adding a RPC ping endpoint could help us solve both these issues.
> > > Note that we still need an external mechanism (out of OpenStack) to do this
> > > ping.
> > > We also think it could be nice for other OpenStackers, and especially
> > > large scale ops.
> > >
> > > Feel free to comment.
> > >
> > >
> > > [1] https://review.opendev.org/#/c/735385/
> > >
> > >
> > > --
> > > Arnaud Morin
> > >
> > >
> >
> >
> > --
> > Mohammed Naser
> > VEXXHOST, Inc.
--
Mohammed Naser
VEXXHOST, Inc.
More information about the openstack-discuss
mailing list