[largescale-sig][nova][neutron][oslo] RPC ping
Ben Nemec
openstack at nemebean.com
Wed Aug 12 15:50:21 UTC 2020
On 8/12/20 5:32 AM, Thierry Carrez wrote:
> Sean Mooney wrote:
>> On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:
>>> I wonder if this does help though. It seems like a bug that a
>>> nova-compute service would stop processing messages and still be seen
>>> as up in the service status. Do we understand why that is happening?
>>> If not, I'm unclear that a ping living at the oslo.messaging layer is
>>> going to do a better job of exposing such an outage. The fact that
>>> oslo.messaging is responding does not necessarily equate to
>>> nova-compute functioning as expected.
>>>
>>> To be clear, this is not me nacking the ping feature. I just want to
>>> make sure we understand what is going on here so we don't add another
>>> unreliable healthchecking mechanism to the one we already have.
>> [...]
>> im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug
>> that is motiviting the creation of this oslo ping
>> feature but that feels premature if it is. i think it would be better
>> try to adress this by the sender recreating the
>> queue if the deliver fails and if that is not viable then protpyope
>> thge fix in nova. if the self ping fixes this
>> miss queue error then we could extract the cod into oslo.
>
> I think this is missing the point... This is not about working around a
> specific bug, it's about adding a way to detect a certain class of
> failure. It's more of an operational feature than a development bugfix.
>
> If I understood correctly, OVH is running that patch in production as a
> way to detect certain problems they regularly run into, something our
> existing monitor mechanisms fail to detect. That sounds like a
> worthwhile addition?
Okay, I don't think I was aware that this was already being used. If
someone already finds it useful and it's opt-in then I'm not inclined to
block it. My main concern was that we were adding a feature that didn't
actually address the problem at hand.
I _would_ feel better about it if someone could give an example of a
type of failure this is detecting that is missed by other monitoring
methods though. Both because having a concrete example of a use case for
the feature is good, and because if it turns out that the problems this
is detecting are things like the Nova bug Sean is talking about (which I
don't think this would catch anyway, since the topic is missing and
there's nothing to ping) then there may be other changes we can/should
make to improve things.
>
> Alternatively, if we can monitor the exact same class of failures using
> our existing systems (or by improving them rather than adding a new
> door), that works too.
>
More information about the openstack-discuss
mailing list