Re: [largescale-sig][nova][neutron][oslo] RPC ping

20 Aug 2020

      Hey all,

TLDR:
- Patch in [1] updated
- Example of usage in [3]
- Agree with fixing nova/rabbit/oslo but would like to keep this ping
  endpoint also
- Totally agree with documentation needed

Long:

Thank you all for your review and for the great information you bring to
that topic!

First thing, we are not yet using that patch in production, but in
testing/dev only for now (at OVH).
But the plan is to use it in production ASAP.

Also, we initially pushed that for neutron agent, that's why I missed
the fact that nova already used the "ping" endpoint, sorry for that.

Anyway, I dont care about the naming, so in latest patchset of [1], you
will see that I changed the name of the endpoint following Ken Giusti
suggestions.

The bug reported in [2] looks very similar to what we saw.
Thank you Sean for bringing that to attention in this thread.

To detect this error, using the above "ping" endpoint in oslo, we can
use a script like the one in [3] (sorry about it, I can write better
python :p).
As mentionned by Sean in a previous mail, I am calling effectively
the topic "compute.host123456.sbg5.cloud.ovh.net" in "nova" exchange.
My initial plan would be to identify topics related to a compute and do
pings in all topics, to make sure that all of them are answering.
I am not yet sure about how often and if this is a good plan btw.

Anyway, the compute is reporting status as UP, but the ping is
timeouting, which is exactly what I wanted to detect!

I mostly agree with all your comments about the fact that this is a
trick that we do as operator, and using the RPC bus is maybe not the
best approach, but this is pragmatic and quite simple IMHO.
What I also like in this solution is the fact that this is partialy 
outside of OpenStack: the endpoint is inside, but doing the ping is
external.
Monitoring OpenStack is not always easy, and sometimes we struggle on
finding the root cause of some issues. Having such endpoint
allow us to monitor OpenStack from an external point of view, but still
in a deeper way.
It's like a probe in your car telling you that even if you are still
running, your engine is off :)

Still, making sure that this bug is fixed by doing some work on 
(rabbit|oslo.messaging|nova|whatever} is the best thing to do.

However, IMO, this does not prevent this rpc ping endpoint from
existing.

Last, but not least, I totally agree about documenting this, but also
adding some documentation on how to configure rabbit and OpenStack
services in a way that fit operator needs.
There are plenty of parameters which could be tweaked on both OpenStack
and rabbit side. IMO, we need to explain a little bit more what are the
impact of setting a specific parameter to a given value.
For example, in another discussion ([4]), we were talking about
"durable" queues in rabbit. We manage to find that if we enable HA, we
should also enable durability of queues.

Anyway that's another topic, and this is also something we discuss in
large-scale group.

Thank you all,

[1] https://review.opendev.org/#/c/735385/
[2] https://bugs.launchpad.net/nova/+bug/1854992
[3] http://paste.openstack.org/show/796990/
[4] http://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.ht...

-- 
Arnaud Morin

On 13.08.20 - 17:17, Ken Giusti wrote:
...
On Thu, Aug 13, 2020 at 12:30 PM Ben Nemec <openstack@nemebean.com> wrote:
...
On 8/13/20 11:07 AM, Sean Mooney wrote:
...
...
I think it's probably
better to provide a well-defined endpoint for them to talk to rather
than have everyone implement their own slightly different RPC ping
mechanism. The docs for this feature should be very explicit that this
is the only thing external code should be calling.
ya i think that is a good approch.
i would still prefer if people used say middelware to add a service ping
admin api endpoint
instead of driectly calling the rpc endpoint to avoid exposing rabbitmq
but that is out of scope of this discussion.
Completely agree. In the long run I would like to see this replaced with
better integrated healthchecking in OpenStack, but we've been talking
about that for years and have made minimal progress.
...
...
...
so if this does actully detect somethign we can otherwise detect and
the use cases involves using it within
...
...
...
the openstack services not form an external source then i think that
is fine but we proably need to use another
name (alive? status?) or otherewise modify nova so that there is no
conflict.
...
If I understand your analysis of the bug correctly, this would have
caught that type of outage after all since the failure was asymmetric.
am im not sure
it might yes looking at https://review.opendev.org/#/c/735385/6
its not clear to me how the endpoint is invoked. is it doing a topic
send or a direct send?
to detech the failure you would need to invoke a ping on the compute
service and that ping would
have to been encured on the to nova topic exchante with a routing key of
compute.<compute node hostname>
if the compute topic queue was broken either because it was nolonger
bound to the correct topic or due to some other
rabbitmq error then you woudl either get a message undeilverbale error
of some kind with the mandaroy flag or likely a
timeout without the mandaroty flag. so if the ping would be routed usign
a topic too compute.<compute node hostname>
then yes it would find this.
although we can also detech this ourselves and fix it using the
mandatory flag i think by just recreating the queue wehn
it extis but we get an undeliverable message, at least i think we can
rabbit is not my main are of expertiese so it
woudl be nice is someone that know more about it can weigh in on that.
I pinged Ken this morning to take a look at that. He should be able to
tell us whether it's a good idea or crazy talk. :-)
Like I can tell the difference between crazy and good ideas.  Ben I thought
you knew me better. ;)
As discussed you can enable the mandatory flag on a per RPCClient instance,
for example:
_topts = oslo_messaging.TransportOptions(at_least_once=True)
         client = oslo_messaging.RPCClient(self.transport,
                                      self.target,
                                      timeout=conf.timeout,
                                     version_cap=conf.target_version,
                                     transport_options=_topts).prepare()
This will cause an rpc call/cast to fail if rabbitmq cannot find a queue
for the rpc request message [note the difference between 'queuing the
message' and 'having the message consumed' - the mandatory flag has nothing
to do with whether or not the message is eventually consumed].
Keep in mind that there may be some cases where having no active consumers
is ok and you do not want to get a delivery failure exception -
specifically fanout or perhaps cast.  Depends on the use case.   If there
are fanout use cases that fail or degrade if all present services don't get
a message then the mandatory flag will not detect an error if  a subset of
the bindings are lost.
My biggest concern with this type of failure (lost binding) is that
apparently the consumer is none the wiser when it happens.  Without some
sort of event issued by rabbitmq the RPC server cannot detect this problem
and take corrective actions (or at least I cannot think of any ATM).
-- 
Ken Giusti  (kgiusti@gmail.com)