Sean Mooney wrote:
On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.
To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have. [...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.
I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix. If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition? Alternatively, if we can monitor the exact same class of failures using our existing systems (or by improving them rather than adding a new door), that works too. -- Thierry Carrez (ttx)