[largescale-sig][nova][neutron][oslo] RPC ping

Thierry Carrez thierry at openstack.org
Thu Aug 13 08:24:26 UTC 2020

Ben Nemec wrote:
> On 8/12/20 5:32 AM, Thierry Carrez wrote:
>> Sean Mooney wrote:
>>> On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:
>>>> I wonder if this does help though. It seems like a bug that a 
>>>> nova-compute service would stop processing messages and still be 
>>>> seen as up in the service status. Do we understand why that is 
>>>> happening? If not, I'm unclear that a ping living at the 
>>>> oslo.messaging layer is going to do a better job of exposing such an 
>>>> outage. The fact that oslo.messaging is responding does not 
>>>> necessarily equate to nova-compute functioning as expected.
>>>> To be clear, this is not me nacking the ping feature. I just want to 
>>>> make sure we understand what is going on here so we don't add 
>>>> another unreliable healthchecking mechanism to the one we already have. 
>>> [...]
>>> im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug 
>>> that is motiviting the creation of this oslo ping
>>> feature but that feels premature if it is. i think it would be better 
>>> try to adress this by the sender recreating the
>>> queue if the deliver fails and if that is not viable then protpyope 
>>> thge fix in nova. if the self ping fixes this
>>> miss queue error then we could extract the cod into oslo.
>> I think this is missing the point... This is not about working around 
>> a specific bug, it's about adding a way to detect a certain class of 
>> failure. It's more of an operational feature than a development bugfix.
>> If I understood correctly, OVH is running that patch in production as 
>> a way to detect certain problems they regularly run into, something 
>> our existing monitor mechanisms fail to detect. That sounds like a 
>> worthwhile addition?
> Okay, I don't think I was aware that this was already being used. If 
> someone already finds it useful and it's opt-in then I'm not inclined to 
> block it. My main concern was that we were adding a feature that didn't 
> actually address the problem at hand.
> I _would_ feel better about it if someone could give an example of a 
> type of failure this is detecting that is missed by other monitoring 
> methods though. Both because having a concrete example of a use case for 
> the feature is good, and because if it turns out that the problems this 
> is detecting are things like the Nova bug Sean is talking about (which I 
> don't think this would catch anyway, since the topic is missing and 
> there's nothing to ping) then there may be other changes we can/should 
> make to improve things.

Right. Let's wait for Arnaud to come back from vacation and confirm that

(1) that patch is not a shot in the dark: it allows them to expose a 
class of issues in production

(2) they fail to expose that same class of issues using other existing 
mechanisms, including those just suggested in this thread

I just wanted to avoid early rejection of this health check ability on 
the grounds that the situation it exposes should just not happen. Or 
that, if enabled and heavily used, it would have a performance impact.

Thierry Carrez (ttx)

More information about the openstack-discuss mailing list