On 7/28/20 3:02 AM, Johannes Kulik wrote:
Hi,
On 7/27/20 7:08 PM, Dan Smith wrote:
The primary concern was about something other than nova sitting on our bus making calls to our internal services. I imagine that the proposal to bake it into oslo.messaging is for the same purpose, and I'd probably have the same concern. At the time I think we agreed that if we were going to support direct-to-service health checks, they should be teensy HTTP servers with oslo healthchecks middleware. Further loading down rabbit with those pings doesn't seem like the best plan to me. Especially since Nova (compute) services already check in over RPC periodically and the success of that is discoverable en masse through the API.
--Dan
While I get this concern, we have seen the problem described by the original poster in production multiple times: nova-compute reports to be healthy, is seen as up through the API, but doesn't work on any messages anymore. A health-check going through rabbitmq would really help spotting those situations, while having an additional HTTP server doesn't.
I wonder if this does help though. It seems like a bug that a nova-compute service would stop processing messages and still be seen as up in the service status. Do we understand why that is happening? If not, I'm unclear that a ping living at the oslo.messaging layer is going to do a better job of exposing such an outage. The fact that oslo.messaging is responding does not necessarily equate to nova-compute functioning as expected. To be clear, this is not me nacking the ping feature. I just want to make sure we understand what is going on here so we don't add another unreliable healthchecking mechanism to the one we already have.
Have a nice day, Johannes