Hey all, TLDR: - Patch in [1] updated - Example of usage in [3] - Agree with fixing nova/rabbit/oslo but would like to keep this ping endpoint also - Totally agree with documentation needed Long: Thank you all for your review and for the great information you bring to that topic! First thing, we are not yet using that patch in production, but in testing/dev only for now (at OVH). But the plan is to use it in production ASAP. Also, we initially pushed that for neutron agent, that's why I missed the fact that nova already used the "ping" endpoint, sorry for that. Anyway, I dont care about the naming, so in latest patchset of [1], you will see that I changed the name of the endpoint following Ken Giusti suggestions. The bug reported in [2] looks very similar to what we saw. Thank you Sean for bringing that to attention in this thread. To detect this error, using the above "ping" endpoint in oslo, we can use a script like the one in [3] (sorry about it, I can write better python :p). As mentionned by Sean in a previous mail, I am calling effectively the topic "compute.host123456.sbg5.cloud.ovh.net" in "nova" exchange. My initial plan would be to identify topics related to a compute and do pings in all topics, to make sure that all of them are answering. I am not yet sure about how often and if this is a good plan btw. Anyway, the compute is reporting status as UP, but the ping is timeouting, which is exactly what I wanted to detect! I mostly agree with all your comments about the fact that this is a trick that we do as operator, and using the RPC bus is maybe not the best approach, but this is pragmatic and quite simple IMHO. What I also like in this solution is the fact that this is partialy outside of OpenStack: the endpoint is inside, but doing the ping is external. Monitoring OpenStack is not always easy, and sometimes we struggle on finding the root cause of some issues. Having such endpoint allow us to monitor OpenStack from an external point of view, but still in a deeper way. It's like a probe in your car telling you that even if you are still running, your engine is off :) Still, making sure that this bug is fixed by doing some work on (rabbit|oslo.messaging|nova|whatever} is the best thing to do. However, IMO, this does not prevent this rpc ping endpoint from existing. Last, but not least, I totally agree about documenting this, but also adding some documentation on how to configure rabbit and OpenStack services in a way that fit operator needs. There are plenty of parameters which could be tweaked on both OpenStack and rabbit side. IMO, we need to explain a little bit more what are the impact of setting a specific parameter to a given value. For example, in another discussion ([4]), we were talking about "durable" queues in rabbit. We manage to find that if we enable HA, we should also enable durability of queues. Anyway that's another topic, and this is also something we discuss in large-scale group. Thank you all, [1] https://review.opendev.org/#/c/735385/ [2] https://bugs.launchpad.net/nova/+bug/1854992 [3] http://paste.openstack.org/show/796990/ [4] http://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.ht... -- Arnaud Morin On 13.08.20 - 17:17, Ken Giusti wrote:
On Thu, Aug 13, 2020 at 12:30 PM Ben Nemec <openstack@nemebean.com> wrote:
On 8/13/20 11:07 AM, Sean Mooney wrote:
I think it's probably better to provide a well-defined endpoint for them to talk to rather than have everyone implement their own slightly different RPC ping mechanism. The docs for this feature should be very explicit that this is the only thing external code should be calling. ya i think that is a good approch. i would still prefer if people used say middelware to add a service ping admin api endpoint instead of driectly calling the rpc endpoint to avoid exposing rabbitmq but that is out of scope of this discussion.
Completely agree. In the long run I would like to see this replaced with better integrated healthchecking in OpenStack, but we've been talking about that for years and have made minimal progress.
so if this does actully detect somethign we can otherwise detect and
the use cases involves using it within
the openstack services not form an external source then i think that is fine but we proably need to use another name (alive? status?) or otherewise modify nova so that there is no conflict.
If I understand your analysis of the bug correctly, this would have caught that type of outage after all since the failure was asymmetric. am im not sure it might yes looking at https://review.opendev.org/#/c/735385/6 its not clear to me how the endpoint is invoked. is it doing a topic send or a direct send? to detech the failure you would need to invoke a ping on the compute service and that ping would have to been encured on the to nova topic exchante with a routing key of compute.<compute node hostname>
if the compute topic queue was broken either because it was nolonger bound to the correct topic or due to some other rabbitmq error then you woudl either get a message undeilverbale error of some kind with the mandaroy flag or likely a timeout without the mandaroty flag. so if the ping would be routed usign a topic too compute.<compute node hostname> then yes it would find this.
although we can also detech this ourselves and fix it using the mandatory flag i think by just recreating the queue wehn it extis but we get an undeliverable message, at least i think we can rabbit is not my main are of expertiese so it woudl be nice is someone that know more about it can weigh in on that.
I pinged Ken this morning to take a look at that. He should be able to tell us whether it's a good idea or crazy talk. :-)
Like I can tell the difference between crazy and good ideas. Ben I thought you knew me better. ;)
As discussed you can enable the mandatory flag on a per RPCClient instance, for example:
_topts = oslo_messaging.TransportOptions(at_least_once=True) client = oslo_messaging.RPCClient(self.transport, self.target, timeout=conf.timeout, version_cap=conf.target_version, transport_options=_topts).prepare()
This will cause an rpc call/cast to fail if rabbitmq cannot find a queue for the rpc request message [note the difference between 'queuing the message' and 'having the message consumed' - the mandatory flag has nothing to do with whether or not the message is eventually consumed].
Keep in mind that there may be some cases where having no active consumers is ok and you do not want to get a delivery failure exception - specifically fanout or perhaps cast. Depends on the use case. If there are fanout use cases that fail or degrade if all present services don't get a message then the mandatory flag will not detect an error if a subset of the bindings are lost.
My biggest concern with this type of failure (lost binding) is that apparently the consumer is none the wiser when it happens. Without some sort of event issued by rabbitmq the RPC server cannot detect this problem and take corrective actions (or at least I cannot think of any ATM).
-- Ken Giusti (kgiusti@gmail.com)