[largescale-sig][nova][neutron][oslo] RPC ping

Ken Giusti kgiusti at gmail.com
Thu Aug 13 21:17:51 UTC 2020


On Thu, Aug 13, 2020 at 12:30 PM Ben Nemec <openstack at nemebean.com> wrote:

>
>
> On 8/13/20 11:07 AM, Sean Mooney wrote:
> >>   I think it's probably
> >> better to provide a well-defined endpoint for them to talk to rather
> >> than have everyone implement their own slightly different RPC ping
> >> mechanism. The docs for this feature should be very explicit that this
> >> is the only thing external code should be calling.
> > ya i think that is a good approch.
> > i would still prefer if people used say middelware to add a service ping
> admin api endpoint
> > instead of driectly calling the rpc endpoint to avoid exposing rabbitmq
> but that is out of scope of this discussion.
>
> Completely agree. In the long run I would like to see this replaced with
> better integrated healthchecking in OpenStack, but we've been talking
> about that for years and have made minimal progress.
>
> >
> >>
> >>>
> >>> so if this does actully detect somethign we can otherwise detect and
> the use cases involves using it within
> >>> the openstack services not form an external source then i think that
> is fine but we proably need to use another
> >>> name (alive? status?) or otherewise modify nova so that there is no
> conflict.
> >>>>
> >>
> >> If I understand your analysis of the bug correctly, this would have
> >> caught that type of outage after all since the failure was asymmetric.
> > am im not sure
> > it might yes looking at https://review.opendev.org/#/c/735385/6
> > its not clear to me how the endpoint is invoked. is it doing a topic
> send or a direct send?
> > to detech the failure you would need to invoke a ping on the compute
> service and that ping would
> > have to been encured on the to nova topic exchante with a routing key of
> compute.<compute node hostname>
> >
> > if the compute topic queue was broken either because it was nolonger
> bound to the correct topic or due to some other
> > rabbitmq error then you woudl either get a message undeilverbale error
> of some kind with the mandaroy flag or likely a
> > timeout without the mandaroty flag. so if the ping would be routed usign
> a topic too compute.<compute node hostname>
> > then yes it would find this.
> >
> > although we can also detech this ourselves and fix it using the
> mandatory flag i think by just recreating the queue wehn
> > it extis but we get an undeliverable message, at least i think we can
> rabbit is not my main are of expertiese so it
> > woudl be nice is someone that know more about it can weigh in on that.
>
> I pinged Ken this morning to take a look at that. He should be able to
> tell us whether it's a good idea or crazy talk. :-)
>

Like I can tell the difference between crazy and good ideas.  Ben I thought
you knew me better. ;)

As discussed you can enable the mandatory flag on a per RPCClient instance,
for example:

       _topts = oslo_messaging.TransportOptions(at_least_once=True)
         client = oslo_messaging.RPCClient(self.transport,
                                      self.target,
                                      timeout=conf.timeout,
                                     version_cap=conf.target_version,
                                     transport_options=_topts).prepare()

This will cause an rpc call/cast to fail if rabbitmq cannot find a queue
for the rpc request message [note the difference between 'queuing the
message' and 'having the message consumed' - the mandatory flag has nothing
to do with whether or not the message is eventually consumed].

Keep in mind that there may be some cases where having no active consumers
is ok and you do not want to get a delivery failure exception -
specifically fanout or perhaps cast.  Depends on the use case.   If there
are fanout use cases that fail or degrade if all present services don't get
a message then the mandatory flag will not detect an error if  a subset of
the bindings are lost.

My biggest concern with this type of failure (lost binding) is that
apparently the consumer is none the wiser when it happens.  Without some
sort of event issued by rabbitmq the RPC server cannot detect this problem
and take corrective actions (or at least I cannot think of any ATM).


-- 
Ken Giusti  (kgiusti at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200813/bd1ce268/attachment.html>


More information about the openstack-discuss mailing list