On Thu, Aug 13, 2020 at 12:30 PM Ben Nemec <openstack@nemebean.com> wrote:
On 8/13/20 11:07 AM, Sean Mooney wrote:
I think it's probably better to provide a well-defined endpoint for them to talk to rather than have everyone implement their own slightly different RPC ping mechanism. The docs for this feature should be very explicit that this is the only thing external code should be calling. ya i think that is a good approch. i would still prefer if people used say middelware to add a service ping admin api endpoint instead of driectly calling the rpc endpoint to avoid exposing rabbitmq but that is out of scope of this discussion.
Completely agree. In the long run I would like to see this replaced with better integrated healthchecking in OpenStack, but we've been talking about that for years and have made minimal progress.
so if this does actully detect somethign we can otherwise detect and
the use cases involves using it within
the openstack services not form an external source then i think that is fine but we proably need to use another name (alive? status?) or otherewise modify nova so that there is no conflict.
If I understand your analysis of the bug correctly, this would have caught that type of outage after all since the failure was asymmetric. am im not sure it might yes looking at https://review.opendev.org/#/c/735385/6 its not clear to me how the endpoint is invoked. is it doing a topic send or a direct send? to detech the failure you would need to invoke a ping on the compute service and that ping would have to been encured on the to nova topic exchante with a routing key of compute.<compute node hostname>
if the compute topic queue was broken either because it was nolonger bound to the correct topic or due to some other rabbitmq error then you woudl either get a message undeilverbale error of some kind with the mandaroy flag or likely a timeout without the mandaroty flag. so if the ping would be routed usign a topic too compute.<compute node hostname> then yes it would find this.
although we can also detech this ourselves and fix it using the mandatory flag i think by just recreating the queue wehn it extis but we get an undeliverable message, at least i think we can rabbit is not my main are of expertiese so it woudl be nice is someone that know more about it can weigh in on that.
I pinged Ken this morning to take a look at that. He should be able to tell us whether it's a good idea or crazy talk. :-)
Like I can tell the difference between crazy and good ideas. Ben I thought you knew me better. ;) As discussed you can enable the mandatory flag on a per RPCClient instance, for example: _topts = oslo_messaging.TransportOptions(at_least_once=True) client = oslo_messaging.RPCClient(self.transport, self.target, timeout=conf.timeout, version_cap=conf.target_version, transport_options=_topts).prepare() This will cause an rpc call/cast to fail if rabbitmq cannot find a queue for the rpc request message [note the difference between 'queuing the message' and 'having the message consumed' - the mandatory flag has nothing to do with whether or not the message is eventually consumed]. Keep in mind that there may be some cases where having no active consumers is ok and you do not want to get a delivery failure exception - specifically fanout or perhaps cast. Depends on the use case. If there are fanout use cases that fail or degrade if all present services don't get a message then the mandatory flag will not detect an error if a subset of the bindings are lost. My biggest concern with this type of failure (lost binding) is that apparently the consumer is none the wiser when it happens. Without some sort of event issued by rabbitmq the RPC server cannot detect this problem and take corrective actions (or at least I cannot think of any ATM). -- Ken Giusti (kgiusti@gmail.com)