On 8/13/20 11:07 AM, Sean Mooney wrote:
>> I think it's probably
>> better to provide a well-defined endpoint for them to talk to rather
>> than have everyone implement their own slightly different RPC ping
>> mechanism. The docs for this feature should be very explicit that this
>> is the only thing external code should be calling.
> ya i think that is a good approch.
> i would still prefer if people used say middelware to add a service ping admin api endpoint
> instead of driectly calling the rpc endpoint to avoid exposing rabbitmq but that is out of scope of this discussion.
Completely agree. In the long run I would like to see this replaced with
better integrated healthchecking in OpenStack, but we've been talking
about that for years and have made minimal progress.
>
>>
>>>
>>> so if this does actully detect somethign we can otherwise detect and the use cases involves using it within
>>> the openstack services not form an external source then i think that is fine but we proably need to use another
>>> name (alive? status?) or otherewise modify nova so that there is no conflict.
>>>>
>>
>> If I understand your analysis of the bug correctly, this would have
>> caught that type of outage after all since the failure was asymmetric.
> am im not sure
> it might yes looking at https://review.opendev.org/#/c/735385/6
> its not clear to me how the endpoint is invoked. is it doing a topic send or a direct send?
> to detech the failure you would need to invoke a ping on the compute service and that ping would
> have to been encured on the to nova topic exchante with a routing key of compute.<compute node hostname>
>
> if the compute topic queue was broken either because it was nolonger bound to the correct topic or due to some other
> rabbitmq error then you woudl either get a message undeilverbale error of some kind with the mandaroy flag or likely a
> timeout without the mandaroty flag. so if the ping would be routed usign a topic too compute.<compute node hostname>
> then yes it would find this.
>
> although we can also detech this ourselves and fix it using the mandatory flag i think by just recreating the queue wehn
> it extis but we get an undeliverable message, at least i think we can rabbit is not my main are of expertiese so it
> woudl be nice is someone that know more about it can weigh in on that.
I pinged Ken this morning to take a look at that. He should be able to
tell us whether it's a good idea or crazy talk. :-)
Like I can tell the difference between crazy and good ideas. Ben I thought you knew me better. ;)
As discussed you can enable the mandatory flag on a per RPCClient instance, for example:
_topts = oslo_messaging.TransportOptions(at_least_once=True)
client = oslo_messaging.RPCClient(self.transport,
self.target,
timeout=conf.timeout,
version_cap=conf.target_version,
transport_options=_topts).prepare()
This will cause an rpc call/cast to fail if rabbitmq cannot find a queue for the rpc request message [note the difference between 'queuing the message' and 'having the message consumed' - the mandatory flag has nothing to do with whether or not the message is eventually consumed].
Keep in mind that there may be some cases where having no active consumers is ok and you do not want to get a delivery failure exception - specifically fanout or perhaps cast. Depends on the use case. If there are fanout use cases that fail or degrade if all present services don't get a message then the mandatory flag will not detect an error if a subset of the bindings are lost.