Re: [largescale-sig] RPC ping

12 Aug 2020


      On Thu, Aug 6, 2020 at 10:11 AM Arnaud Morin <arnaud.morin@gmail.com> wrote:
...
Hi Mohammed,
1 - That's something we would also like, but it's beyond the patch I
propose.
I need this patch not only for kubernetes, but also for monitoring my
legagy openstack agents running outside of k8s.
2 - Yes, latest version of rabbitmq is better on that point, but we
still see some weird issue (I will ask the community about it in another
topic).
3 - Thanks for this operator, we'll take a look!
By saying 1 rabbit per service, I understand 1 server, not 1 cluster,
right?
That sounds risky if you lose the server.
The controllers are pretty stable and if a controller dies, Kubernetes
will take care of restarting the pod somewhere else and everything
will reconnect and things will be happy again.
...
I suppose you dont do that for the database?
One database cluster per service, with 'old-school' replication
because no one really does true multimaster in Galera with OpenStack
anyways.
...
4 - Nice, how to you monitor those consumptions? Using rabbit management
API?
Prometheus RabbitMQ exporter, now migrating to the native one shipping
in the new RabbitMQ releases.
...
Cheers,
--
Arnaud Morin
On 03.08.20 - 10:21, Mohammed Naser wrote:
...
I have a few operational suggestions on how I think we could do this best:
1. I think exposing a healthcheck endpoint that _actually_ runs the
ping and responds with a 200 OK makes a lot more sense in terms of
being able to run it inside something like Kubernetes, you end up with
a "who makes the ping and who responds to it" type of scenario which
can be tricky though I'm sure we can figure that out
2. I've found that newer releases of RabbitMQ really help with those
un-usable queues after a split, I haven't had any issues at all with
newer releases, so that could be something to help your life be a lot
easier.
3. You mentioned you're moving towards Kubernetes, we're doing the
same and building an operator:
https://opendev.org/vexxhost/openstack-operator -- Because the
operator manages the whole thing and Kubernetes does it's thing too,
we started moving towards 1 (single) rabbitmq per service, which
reaaaaaaally helped a lot in stabilizing things.  Oslo messaging is a
lot better at recovering when a single service IP is pointing towards
it because it doesn't do weird things like have threads trying to
connect to other Rabbit ports.  Just a thought.
4. In terms of telemetry and making sure you avoid that issue, we
track the consumption rates of queues inside OpenStack.  OpenStack
consumption rate should be constant and never growing, anytime it
grows, we instantly detect that something is fishy.  However, the
other issue comes in that when you restart any openstack service, it
'forgets' all it's existing queues and then you have a set of building
up queues until they automatically expire which happens around 30
minutes-ish, so it makes that alarm of "things are not being consumed"
a little noisy if you're restarting services
Sorry for the wall of super unorganized text, all over the place here
but thought I'd chime in with my 2 cents :)
On Mon, Jul 27, 2020 at 6:04 AM Arnaud Morin <arnaud.morin@gmail.com> wrote:
...
Hey all,
TLDR: I propose a change to oslo_messaging to allow doing a ping over RPC,
      this is useful to monitor liveness of agents.
Few weeks ago, I proposed a patch to oslo_messaging [1], which is adding a
ping endpoint to RPC dispatcher.
It means that every openstack service which is using oslo_messaging RPC
endpoints (almosts all OpenStack services and agents - e.g. neutron
server + agents, nova + computes, etc.) will then be able to answer to a
specific "ping" call over RPC.
I decided to propose this patch in my company mainly for 2 reasons:
1 - we are struggling monitoring our nova compute and neutron agents in a
  correct way:
1.1 - sometimes our agents are disconnected from RPC, but the python process
is still running.
1.2 - sometimes the agent is still connected, but the queue / binding on
rabbit cluster is not working anymore (after a rabbit split for
example). This one is very hard to debug, because the agent is still
reporting health correctly on neutron server, but it's not able to
receive messages anymore.
2 - we are trying to monitor agents running in k8s pods:
when running a python agent (neutron l3-agent for example) in a k8s pod, we
wanted to find a way to monitor if it is still live of not.
Adding a RPC ping endpoint could help us solve both these issues.
Note that we still need an external mechanism (out of OpenStack) to do this
ping.
We also think it could be nice for other OpenStackers, and especially
large scale ops.
Feel free to comment.
[1] https://review.opendev.org/#/c/735385/
--
Arnaud Morin
--
Mohammed Naser
VEXXHOST, Inc.
-- 
Mohammed Naser
VEXXHOST, Inc.