On Thu, 2020-08-13 at 10:28 -0500, Ben Nemec wrote:
On 8/13/20 7:14 AM, Sean Mooney wrote:
On Thu, 2020-08-13 at 10:24 +0200, Thierry Carrez wrote:
Ben Nemec wrote:
On 8/12/20 5:32 AM, Thierry Carrez wrote:
Sean Mooney wrote:
On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote: > I wonder if this does help though. It seems like a bug that a > nova-compute service would stop processing messages and still be > seen as up in the service status. Do we understand why that is > happening? If not, I'm unclear that a ping living at the > oslo.messaging layer is going to do a better job of exposing such an > outage. The fact that oslo.messaging is responding does not > necessarily equate to nova-compute functioning as expected. > > To be clear, this is not me nacking the ping feature. I just want to > make sure we understand what is going on here so we don't add > another unreliable healthchecking mechanism to the one we already have.
[...] im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.
I think this is missing the point... This is not about working around a specific bug, it's about adding a way to detect a certain class of failure. It's more of an operational feature than a development bugfix.
If I understood correctly, OVH is running that patch in production as a way to detect certain problems they regularly run into, something our existing monitor mechanisms fail to detect. That sounds like a worthwhile addition?
Okay, I don't think I was aware that this was already being used. If someone already finds it useful and it's opt-in then I'm not inclined to block it. My main concern was that we were adding a feature that didn't actually address the problem at hand.
I _would_ feel better about it if someone could give an example of a type of failure this is detecting that is missed by other monitoring methods though. Both because having a concrete example of a use case for the feature is good, and because if it turns out that the problems this is detecting are things like the Nova bug Sean is talking about (which I don't think this would catch anyway, since the topic is missing and there's nothing to ping) then there may be other changes we can/should make to improve things.
Right. Let's wait for Arnaud to come back from vacation and confirm that
(1) that patch is not a shot in the dark: it allows them to expose a class of issues in production
(2) they fail to expose that same class of issues using other existing mechanisms, including those just suggested in this thread
I just wanted to avoid early rejection of this health check ability on the grounds that the situation it exposes should just not happen. Or that, if enabled and heavily used, it would have a performance impact.
I think the inital push back from nova is we already have ping rpc function https://github.com/openstack/nova/blob/c6218428e9b29a2c52808ec7d27b4b21aadc0... so if a geneirc metion called ping is added it will break nova.
It occurred to me after I commented on the review that we have tempest running on oslo.messaging changes and it passed on the patch for this. I suppose it's possible that it broke some error handling in Nova that just isn't tested, but maybe the new ping could function as a cross-project replacement for the Nova ping?
proably yes its only used in one place https://opendev.org/openstack/nova/src/branch/master/nova/conductor/api.py#L... which is only used here in the nova service base class https://github.com/openstack/nova/blob/0b613729ff975f69587a17cc7818c09f7683e... os worst case i think its just going to cause the service to start before the conductor is ready however they have to tolerate the conductor restarting ectra anyway so i dont think it will break anything too badly. i dont see why we coudl not use a generic version instead.
Anyway, it's still be to deduplicate the name, but I felt kind of dumb about having asked if it was tested when the test results were right in front of me. ;-)
the reset of the push back is related to not haveing a concrete usecase, including concern over perfroamce consideration and external services potenailly acessing the rpc bus which is coniserd an internal api. e.g. we woudl not want an external monitoring solution connecting to the rpc bus and invoking arbitary RPC calls, ping is well pretty safe but form a design point of view while litening to notification is fine we dont want anything outside of the openstack services actully sending message on the rpc bus.
I'm not concerned about the performance impact here. It's an optional feature, so anyone using it is choosing to take that hit.
Having external stuff on the RPC bus is more of a gray area, but it's not like we can stop operators from doing that.
well upstream certenly we cant really stop them. downstream on the other hadn without going through the certification process to have your product certifed to work with our downstream distrobution directlly invoking RPC endpoint would invlaidate your support. so form a dwonstream perpective we do have ways to prevent that via docs and makeing it clear that it not supported. we can technically do that upstream but cant really enforce it, its opensouce software after all if you break it then you get to keep the broken pices.
I think it's probably better to provide a well-defined endpoint for them to talk to rather than have everyone implement their own slightly different RPC ping mechanism. The docs for this feature should be very explicit that this is the only thing external code should be calling. ya i think that is a good approch. i would still prefer if people used say middelware to add a service ping admin api endpoint instead of driectly calling the rpc endpoint to avoid exposing rabbitmq but that is out of scope of this discussion.
so if this does actully detect somethign we can otherwise detect and the use cases involves using it within the openstack services not form an external source then i think that is fine but we proably need to use another name (alive? status?) or otherewise modify nova so that there is no conflict.
If I understand your analysis of the bug correctly, this would have caught that type of outage after all since the failure was asymmetric.
am im not sure it might yes looking at https://review.opendev.org/#/c/735385/6 its not clear to me how the endpoint is invoked. is it doing a topic send or a direct send? to detech the failure you would need to invoke a ping on the compute service and that ping would have to been encured on the to nova topic exchante with a routing key of compute.<compute node hostname> if the compute topic queue was broken either because it was nolonger bound to the correct topic or due to some other rabbitmq error then you woudl either get a message undeilverbale error of some kind with the mandaroy flag or likely a timeout without the mandaroty flag. so if the ping would be routed usign a topic too compute.<compute node hostname> then yes it would find this. although we can also detech this ourselves and fix it using the mandatory flag i think by just recreating the queue wehn it extis but we get an undeliverable message, at least i think we can rabbit is not my main are of expertiese so it woudl be nice is someone that know more about it can weigh in on that.
The compute node was still able to send its status updates to Nova, but wasn't receiving any messages. A ping would have detected that situation.