On Wed, 2020-08-12 at 12:32 +0200, Thierry Carrez wrote:
Sean Mooney wrote:
On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.
To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have.
[...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.
I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix.
right but we are concerned that there will be a negitive perfromance impact to adding it and it wont detect the one bug we are aware of of this type in a way that we could not also detect by using the mandtory flag. nova already has a heartbeat that the agents send to the conducto to report they are still alive. this ping would work in the opisite direction by reaching out to the compute node over the rpc bus. but that would only detect teh vailure mode if the pic use the topic queue and it could only fix it if recreating the queue via the conducor is a viable solution if it is using the mandataory flag and just recreating it is a better solution since we dont need to ping constantly in the background. if we get teh excpeiton we create the queue and retransmit. the compute manger does not resubscribe to the topic when the queue is recreated automaticaly then the new ping feature wont really help. we would need the comptue service or any other service that subsibse to the topic queue to try to ping its own topic queue and if that fails recreate the subsribtion/queue. as far as i am ware that is not what the fature is proposing
If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition?
im not sure what failure mode it will detect. if they can define that then it would help with understanding if this is worthwhile or not.
Alternatively, if we can monitor the exact same class of failures using our existing systems (or by improving them rather than adding a new door), that works too.
we can monitor the exitance of the queue at least form the rabbitmq api(its disable by defualt but just enable the rabbit-managment plugin) but im not sure what there current issue this is trying to solve is.