Thanks for your patience with this! In the last Oslo meeting we had discussed possibly adding some sort of ping client to oslo.messaging to provide a common interface to use this. That would mitigate some of the concerns about everyone having to write their own ping test and potentially sending incorrect messages on the rabbit bus. Obviously that would be done as a followup to this, but I thought I'd mention it in case anyone wants to take a crack at writing something up. On 8/20/20 10:35 AM, Arnaud Morin wrote:
Hey all,
TLDR: - Patch in [1] updated - Example of usage in [3] - Agree with fixing nova/rabbit/oslo but would like to keep this ping endpoint also - Totally agree with documentation needed
Long:
Thank you all for your review and for the great information you bring to that topic!
First thing, we are not yet using that patch in production, but in testing/dev only for now (at OVH). But the plan is to use it in production ASAP.
Also, we initially pushed that for neutron agent, that's why I missed the fact that nova already used the "ping" endpoint, sorry for that.
Anyway, I dont care about the naming, so in latest patchset of [1], you will see that I changed the name of the endpoint following Ken Giusti suggestions.
The bug reported in [2] looks very similar to what we saw. Thank you Sean for bringing that to attention in this thread.
To detect this error, using the above "ping" endpoint in oslo, we can use a script like the one in [3] (sorry about it, I can write better python :p). As mentionned by Sean in a previous mail, I am calling effectively the topic "compute.host123456.sbg5.cloud.ovh.net" in "nova" exchange. My initial plan would be to identify topics related to a compute and do pings in all topics, to make sure that all of them are answering. I am not yet sure about how often and if this is a good plan btw.
Anyway, the compute is reporting status as UP, but the ping is timeouting, which is exactly what I wanted to detect!
I mostly agree with all your comments about the fact that this is a trick that we do as operator, and using the RPC bus is maybe not the best approach, but this is pragmatic and quite simple IMHO. What I also like in this solution is the fact that this is partialy outside of OpenStack: the endpoint is inside, but doing the ping is external. Monitoring OpenStack is not always easy, and sometimes we struggle on finding the root cause of some issues. Having such endpoint allow us to monitor OpenStack from an external point of view, but still in a deeper way. It's like a probe in your car telling you that even if you are still running, your engine is off :)
Still, making sure that this bug is fixed by doing some work on (rabbit|oslo.messaging|nova|whatever} is the best thing to do.
However, IMO, this does not prevent this rpc ping endpoint from existing.
Last, but not least, I totally agree about documenting this, but also adding some documentation on how to configure rabbit and OpenStack services in a way that fit operator needs. There are plenty of parameters which could be tweaked on both OpenStack and rabbit side. IMO, we need to explain a little bit more what are the impact of setting a specific parameter to a given value. For example, in another discussion ([4]), we were talking about "durable" queues in rabbit. We manage to find that if we enable HA, we should also enable durability of queues.
Anyway that's another topic, and this is also something we discuss in large-scale group.
Thank you all,
[1] https://review.opendev.org/#/c/735385/ [2] https://bugs.launchpad.net/nova/+bug/1854992 [3] http://paste.openstack.org/show/796990/ [4] http://lists.openstack.org/pipermail/openstack-discuss/2020-August/016362.ht...