[largescale-sig][nova][neutron][oslo] RPC ping

Ben Nemec openstack at nemebean.com
Wed Aug 12 15:50:21 UTC 2020

On 8/12/20 5:32 AM, Thierry Carrez wrote:
> Sean Mooney wrote:
>> On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:
>>> I wonder if this does help though. It seems like a bug that a 
>>> nova-compute service would stop processing messages and still be seen 
>>> as up in the service status. Do we understand why that is happening? 
>>> If not, I'm unclear that a ping living at the oslo.messaging layer is 
>>> going to do a better job of exposing such an outage. The fact that 
>>> oslo.messaging is responding does not necessarily equate to 
>>> nova-compute functioning as expected.
>>> To be clear, this is not me nacking the ping feature. I just want to 
>>> make sure we understand what is going on here so we don't add another 
>>> unreliable healthchecking mechanism to the one we already have. 
>> [...]
>> im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug 
>> that is motiviting the creation of this oslo ping
>> feature but that feels premature if it is. i think it would be better 
>> try to adress this by the sender recreating the
>> queue if the deliver fails and if that is not viable then protpyope 
>> thge fix in nova. if the self ping fixes this
>> miss queue error then we could extract the cod into oslo.
> I think this is missing the point... This is not about working around a 
> specific bug, it's about adding a way to detect a certain class of 
> failure. It's more of an operational feature than a development bugfix.
> If I understood correctly, OVH is running that patch in production as a 
> way to detect certain problems they regularly run into, something our 
> existing monitor mechanisms fail to detect. That sounds like a 
> worthwhile addition?

Okay, I don't think I was aware that this was already being used. If 
someone already finds it useful and it's opt-in then I'm not inclined to 
block it. My main concern was that we were adding a feature that didn't 
actually address the problem at hand.

I _would_ feel better about it if someone could give an example of a 
type of failure this is detecting that is missed by other monitoring 
methods though. Both because having a concrete example of a use case for 
the feature is good, and because if it turns out that the problems this 
is detecting are things like the Nova bug Sean is talking about (which I 
don't think this would catch anyway, since the topic is missing and 
there's nothing to ping) then there may be other changes we can/should 
make to improve things.

> Alternatively, if we can monitor the exact same class of failures using 
> our existing systems (or by improving them rather than adding a new 
> door), that works too.

More information about the openstack-discuss mailing list