[largescale-sig][nova][neutron][oslo] RPC ping

Arnaud Morin arnaud.morin at gmail.com
Thu Aug 20 15:35:03 UTC 2020


Hey all,

TLDR:
- Patch in [1] updated
- Example of usage in [3]
- Agree with fixing nova/rabbit/oslo but would like to keep this ping
  endpoint also
- Totally agree with documentation needed

Long:

Thank you all for your review and for the great information you bring to
that topic!

First thing, we are not yet using that patch in production, but in
testing/dev only for now (at OVH).
But the plan is to use it in production ASAP.

Also, we initially pushed that for neutron agent, that's why I missed
the fact that nova already used the "ping" endpoint, sorry for that.

Anyway, I dont care about the naming, so in latest patchset of [1], you
will see that I changed the name of the endpoint following Ken Giusti
suggestions.

The bug reported in [2] looks very similar to what we saw.
Thank you Sean for bringing that to attention in this thread.

To detect this error, using the above "ping" endpoint in oslo, we can
use a script like the one in [3] (sorry about it, I can write better
python :p).
As mentionned by Sean in a previous mail, I am calling effectively
the topic "compute.host123456.sbg5.cloud.ovh.net" in "nova" exchange.
My initial plan would be to identify topics related to a compute and do
pings in all topics, to make sure that all of them are answering.
I am not yet sure about how often and if this is a good plan btw.

Anyway, the compute is reporting status as UP, but the ping is
timeouting, which is exactly what I wanted to detect!

I mostly agree with all your comments about the fact that this is a
trick that we do as operator, and using the RPC bus is maybe not the
best approach, but this is pragmatic and quite simple IMHO.
What I also like in this solution is the fact that this is partialy 
outside of OpenStack: the endpoint is inside, but doing the ping is
external.
Monitoring OpenStack is not always easy, and sometimes we struggle on
finding the root cause of some issues. Having such endpoint
allow us to monitor OpenStack from an external point of view, but still
in a deeper way.
It's like a probe in your car telling you that even if you are still
running, your engine is off :)

Still, making sure that this bug is fixed by doing some work on 
(rabbit|oslo.messaging|nova|whatever} is the best thing to do.

However, IMO, this does not prevent this rpc ping endpoint from
existing.

Last, but not least, I totally agree about documenting this, but also
adding some documentation on how to configure rabbit and OpenStack
services in a way that fit operator needs.
There are plenty of parameters which could be tweaked on both OpenStack
and rabbit side. IMO, we need to explain a little bit more what are the
impact of setting a specific parameter to a given value.
For example, in another discussion ([4]), we were talking about
"durable" queues in rabbit. We manage to find that if we enable HA, we
should also enable durability of queues.

Anyway that's another topic, and this is also something we discuss in
large-scale group.

Thank you all,

[1] https://review.opendev.org/#/c/735385/
[2] https://bugs.launchpad.net/nova/+bug/1854992
[3] http://paste.openstack.org/show/796990/
[4] http://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.html


-- 
Arnaud Morin

On 13.08.20 - 17:17, Ken Giusti wrote:
> On Thu, Aug 13, 2020 at 12:30 PM Ben Nemec <openstack at nemebean.com> wrote:
> 
> >
> >
> > On 8/13/20 11:07 AM, Sean Mooney wrote:
> > >>   I think it's probably
> > >> better to provide a well-defined endpoint for them to talk to rather
> > >> than have everyone implement their own slightly different RPC ping
> > >> mechanism. The docs for this feature should be very explicit that this
> > >> is the only thing external code should be calling.
> > > ya i think that is a good approch.
> > > i would still prefer if people used say middelware to add a service ping
> > admin api endpoint
> > > instead of driectly calling the rpc endpoint to avoid exposing rabbitmq
> > but that is out of scope of this discussion.
> >
> > Completely agree. In the long run I would like to see this replaced with
> > better integrated healthchecking in OpenStack, but we've been talking
> > about that for years and have made minimal progress.
> >
> > >
> > >>
> > >>>
> > >>> so if this does actully detect somethign we can otherwise detect and
> > the use cases involves using it within
> > >>> the openstack services not form an external source then i think that
> > is fine but we proably need to use another
> > >>> name (alive? status?) or otherewise modify nova so that there is no
> > conflict.
> > >>>>
> > >>
> > >> If I understand your analysis of the bug correctly, this would have
> > >> caught that type of outage after all since the failure was asymmetric.
> > > am im not sure
> > > it might yes looking at https://review.opendev.org/#/c/735385/6
> > > its not clear to me how the endpoint is invoked. is it doing a topic
> > send or a direct send?
> > > to detech the failure you would need to invoke a ping on the compute
> > service and that ping would
> > > have to been encured on the to nova topic exchante with a routing key of
> > compute.<compute node hostname>
> > >
> > > if the compute topic queue was broken either because it was nolonger
> > bound to the correct topic or due to some other
> > > rabbitmq error then you woudl either get a message undeilverbale error
> > of some kind with the mandaroy flag or likely a
> > > timeout without the mandaroty flag. so if the ping would be routed usign
> > a topic too compute.<compute node hostname>
> > > then yes it would find this.
> > >
> > > although we can also detech this ourselves and fix it using the
> > mandatory flag i think by just recreating the queue wehn
> > > it extis but we get an undeliverable message, at least i think we can
> > rabbit is not my main are of expertiese so it
> > > woudl be nice is someone that know more about it can weigh in on that.
> >
> > I pinged Ken this morning to take a look at that. He should be able to
> > tell us whether it's a good idea or crazy talk. :-)
> >
> 
> Like I can tell the difference between crazy and good ideas.  Ben I thought
> you knew me better. ;)
> 
> As discussed you can enable the mandatory flag on a per RPCClient instance,
> for example:
> 
>        _topts = oslo_messaging.TransportOptions(at_least_once=True)
>          client = oslo_messaging.RPCClient(self.transport,
>                                       self.target,
>                                       timeout=conf.timeout,
>                                      version_cap=conf.target_version,
>                                      transport_options=_topts).prepare()
> 
> This will cause an rpc call/cast to fail if rabbitmq cannot find a queue
> for the rpc request message [note the difference between 'queuing the
> message' and 'having the message consumed' - the mandatory flag has nothing
> to do with whether or not the message is eventually consumed].
> 
> Keep in mind that there may be some cases where having no active consumers
> is ok and you do not want to get a delivery failure exception -
> specifically fanout or perhaps cast.  Depends on the use case.   If there
> are fanout use cases that fail or degrade if all present services don't get
> a message then the mandatory flag will not detect an error if  a subset of
> the bindings are lost.
> 
> My biggest concern with this type of failure (lost binding) is that
> apparently the consumer is none the wiser when it happens.  Without some
> sort of event issued by rabbitmq the RPC server cannot detect this problem
> and take corrective actions (or at least I cannot think of any ATM).
> 
> 
> -- 
> Ken Giusti  (kgiusti at gmail.com)



More information about the openstack-discuss mailing list