[largescale-sig][nova][neutron][oslo] RPC ping
Ben Nemec
openstack at nemebean.com
Tue Aug 11 20:20:43 UTC 2020
On 7/28/20 3:02 AM, Johannes Kulik wrote:
> Hi,
>
> On 7/27/20 7:08 PM, Dan Smith wrote:
>>
>> The primary concern was about something other than nova sitting on our
>> bus making calls to our internal services. I imagine that the proposal
>> to bake it into oslo.messaging is for the same purpose, and I'd probably
>> have the same concern. At the time I think we agreed that if we were
>> going to support direct-to-service health checks, they should be teensy
>> HTTP servers with oslo healthchecks middleware. Further loading down
>> rabbit with those pings doesn't seem like the best plan to
>> me. Especially since Nova (compute) services already check in over RPC
>> periodically and the success of that is discoverable en masse through
>> the API.
>>
>> --Dan
>>
>
> While I get this concern, we have seen the problem described by the
> original poster in production multiple times: nova-compute reports to be
> healthy, is seen as up through the API, but doesn't work on any messages
> anymore.
> A health-check going through rabbitmq would really help spotting those
> situations, while having an additional HTTP server doesn't.
I wonder if this does help though. It seems like a bug that a
nova-compute service would stop processing messages and still be seen as
up in the service status. Do we understand why that is happening? If
not, I'm unclear that a ping living at the oslo.messaging layer is going
to do a better job of exposing such an outage. The fact that
oslo.messaging is responding does not necessarily equate to nova-compute
functioning as expected.
To be clear, this is not me nacking the ping feature. I just want to
make sure we understand what is going on here so we don't add another
unreliable healthchecking mechanism to the one we already have.
>
> Have a nice day,
> Johannes
>
More information about the openstack-discuss
mailing list