[largescale-sig][nova][neutron][oslo] RPC ping

Thierry Carrez thierry at openstack.org
Wed Aug 12 10:32:27 UTC 2020


Sean Mooney wrote:
> On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:
>> I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.
>> 
>> To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have. 
> [...]
> im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping
> feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the
> queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this
> miss queue error then we could extract the cod into oslo.

I think this is missing the point... This is not about working around a 
specific bug, it's about adding a way to detect a certain class of 
failure. It's more of an operational feature than a development bugfix.

If I understood correctly, OVH is running that patch in production as a 
way to detect certain problems they regularly run into, something our 
existing monitor mechanisms fail to detect. That sounds like a 
worthwhile addition?

Alternatively, if we can monitor the exact same class of failures using 
our existing systems (or by improving them rather than adding a new 
door), that works too.

-- 
Thierry Carrez (ttx)



More information about the openstack-discuss mailing list