[largescale-sig][nova][neutron][oslo] RPC ping

Ben Nemec openstack at nemebean.com
Tue Aug 11 20:20:43 UTC 2020



On 7/28/20 3:02 AM, Johannes Kulik wrote:
> Hi,
> 
> On 7/27/20 7:08 PM, Dan Smith wrote:
>>
>> The primary concern was about something other than nova sitting on our
>> bus making calls to our internal services. I imagine that the proposal
>> to bake it into oslo.messaging is for the same purpose, and I'd probably
>> have the same concern. At the time I think we agreed that if we were
>> going to support direct-to-service health checks, they should be teensy
>> HTTP servers with oslo healthchecks middleware. Further loading down
>> rabbit with those pings doesn't seem like the best plan to
>> me. Especially since Nova (compute) services already check in over RPC
>> periodically and the success of that is discoverable en masse through
>> the API.
>>
>> --Dan
>>
> 
> While I get this concern, we have seen the problem described by the 
> original poster in production multiple times: nova-compute reports to be 
> healthy, is seen as up through the API, but doesn't work on any messages 
> anymore.
> A health-check going through rabbitmq would really help spotting those 
> situations, while having an additional HTTP server doesn't.

I wonder if this does help though. It seems like a bug that a 
nova-compute service would stop processing messages and still be seen as 
up in the service status. Do we understand why that is happening? If 
not, I'm unclear that a ping living at the oslo.messaging layer is going 
to do a better job of exposing such an outage. The fact that 
oslo.messaging is responding does not necessarily equate to nova-compute 
functioning as expected.

To be clear, this is not me nacking the ping feature. I just want to 
make sure we understand what is going on here so we don't add another 
unreliable healthchecking mechanism to the one we already have.

> 
> Have a nice day,
> Johannes
> 



More information about the openstack-discuss mailing list