Re: [largescale-sig][nova][neutron][oslo] RPC ping

11 Aug 2020

      On 7/28/20 3:02 AM, Johannes Kulik wrote:
...
Hi,
On 7/27/20 7:08 PM, Dan Smith wrote:
...
The primary concern was about something other than nova sitting on our
bus making calls to our internal services. I imagine that the proposal
to bake it into oslo.messaging is for the same purpose, and I'd probably
have the same concern. At the time I think we agreed that if we were
going to support direct-to-service health checks, they should be teensy
HTTP servers with oslo healthchecks middleware. Further loading down
rabbit with those pings doesn't seem like the best plan to
me. Especially since Nova (compute) services already check in over RPC
periodically and the success of that is discoverable en masse through
the API.
--Dan
While I get this concern, we have seen the problem described by the 
original poster in production multiple times: nova-compute reports to be 
healthy, is seen as up through the API, but doesn't work on any messages 
anymore.
A health-check going through rabbitmq would really help spotting those 
situations, while having an additional HTTP server doesn't.
I wonder if this does help though. It seems like a bug that a 
nova-compute service would stop processing messages and still be seen as 
up in the service status. Do we understand why that is happening? If 
not, I'm unclear that a ping living at the oslo.messaging layer is going 
to do a better job of exposing such an outage. The fact that 
oslo.messaging is responding does not necessarily equate to nova-compute 
functioning as expected.

To be clear, this is not me nacking the ping feature. I just want to 
make sure we understand what is going on here so we don't add another 
unreliable healthchecking mechanism to the one we already have.
...
Have a nice day,
Johannes

Re: [largescale-sig][nova][neutron][oslo] RPC ping

Ben Nemec