On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:
On 7/28/20 3:02 AM, Johannes Kulik wrote:
Hi,
On 7/27/20 7:08 PM, Dan Smith wrote:
The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.
--Dan
While I get this concern, we have seen the problem described by the original poster in production multiple times: nova-compute reports to be healthy, is seen as up through the API, but doesn't work on any messages anymore. A health-check going through rabbitmq would really help spotting those situations, while having an additional HTTP server doesn't.
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status.
Do we understand why that is happening? assuming it is https://bugs.launchpad.net/nova/+bug/1854992 then then the reason
it kind of is a bug this one to be precise https://bugs.launchpad.net/nova/+bug/1854992 the compute status is still up is the compute service is runing fine and sending heartbeats, the issue is that under certin failure modes the topic queue used to recivie rpc topic sends can disappear. one way this can happen is if the rabbitmq server restart, in which case the resend code in oslo will reconnect to the exchange but it will not nessisarly recreate the topic queue.
If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected.
maybe saying that a little clear. https://bugs.launchpad.net/nova/+bug/1854992 has other causes beyond the rabbit mq server crahsing but the underlying effect is the same the queue that the compute service uses to recive rpc call destroyed and not recreated. a related oslo bug https://bugs.launchpad.net/oslo.messaging/+bug/1661510 was "fixed" by add the mandatory transport flag feature. (you can porably mark that as fixed releaed by the way) from a nova persepctive the intened way to fix the nova bug was to use the new mandartroy flag and catch the MessageUndeliverable and have the conductor/api recreate the compute services topic queue and resent the amqp message. An open question is will the compute service detact that and start processing the queue again. if that will not fix the problem plan b was to add a self ping to the compute service wehere the compute service, on a long timeout (once an hour may once every 15 mins at the most), would try to send a message to its own recive queue. if it got the MessageUndeliverable excption then the comptue service woudl recreate its own queue. addint an interservice ping or triggering the ping enternally is unlikely to help with the nova bug. ideally we would prefer to have the conductor/api recreate the queue and re send the message if it detect the queue is missing rather then have a self ping as that does not add addtional load to the message bus and only recreates the queue if its needed. im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this miss queue error then we could extract the cod into oslo.
To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have.
Have a nice day, Johannes