[largescale-sig][nova][neutron][oslo] RPC ping

Sean Mooney smooney at redhat.com
Tue Aug 11 21:20:07 UTC 2020

On Tue, 2020-08-11 at 15:20 -0500, Ben Nemec wrote:
> On 7/28/20 3:02 AM, Johannes Kulik wrote:
> > Hi,
> > 
> > On 7/27/20 7:08 PM, Dan Smith wrote:
> > > 
> > > The primary concern was about something other than nova sitting on our
> > > bus making calls to our internal services. I imagine that the proposal
> > > to bake it into oslo.messaging is for the same purpose, and I'd probably
> > > have the same concern. At the time I think we agreed that if we were
> > > going to support direct-to-service health checks, they should be teensy
> > > HTTP servers with oslo healthchecks middleware. Further loading down
> > > rabbit with those pings doesn't seem like the best plan to
> > > me. Especially since Nova (compute) services already check in over RPC
> > > periodically and the success of that is discoverable en masse through
> > > the API.
> > > 
> > > --Dan
> > > 
> > 
> > While I get this concern, we have seen the problem described by the 
> > original poster in production multiple times: nova-compute reports to be 
> > healthy, is seen as up through the API, but doesn't work on any messages 
> > anymore.
> > A health-check going through rabbitmq would really help spotting those 
> > situations, while having an additional HTTP server doesn't.
> I wonder if this does help though. It seems like a bug that a 
> nova-compute service would stop processing messages and still be seen as 
> up in the service status.
it kind of is a bug this one to be precise  https://bugs.launchpad.net/nova/+bug/1854992
>  Do we understand why that is happening?
assuming it is  https://bugs.launchpad.net/nova/+bug/1854992 then then the reason 
the compute status is still up is the compute service is runing fine and sending heartbeats,
the issue is that under certin failure modes the topic queue used to recivie rpc topic sends
can disappear. one way this can happen is if the rabbitmq server restart, in which case the resend
code in oslo will reconnect to the exchange but it will not nessisarly recreate the topic queue.
>  If 
> not, I'm unclear that a ping living at the oslo.messaging layer is going 
> to do a better job of exposing such an outage. The fact that 
> oslo.messaging is responding does not necessarily equate to nova-compute 
> functioning as expected.

maybe saying that a little clear. https://bugs.launchpad.net/nova/+bug/1854992 has other
causes beyond the rabbit mq server crahsing  but the underlying effect is the same the queue
that the compute service uses to recive rpc call destroyed and not recreated. a related
oslo bug https://bugs.launchpad.net/oslo.messaging/+bug/1661510 was "fixed" by add the mandatory
transport flag feature. (you can porably mark that as fixed releaed by the way)

from a nova persepctive  the intened way to fix the nova bug  was to use the new mandartroy flag
and catch the MessageUndeliverable and have the conductor/api recreate the compute
services topic queue and resent the amqp message.

An open question is will the compute service detact that and start processing the queue again.
if that will not fix the problem plan b was to add a self ping to the compute service
wehere the compute service, on a long timeout (once an hour may once every 15 mins at the most),
would try to send a message to its own recive queue. if it got the MessageUndeliverable excption
then the comptue service woudl recreate its own queue.

addint an interservice ping or triggering the ping enternally is unlikely to help with the nova bug.
ideally we would prefer to have the conductor/api recreate the queue and re send the message if it detect the queue
is missing rather then have a self ping as that does not add addtional load to the message bus and only recreates the
queue if its needed.

im not sure https://bugs.launchpad.net/nova/+bug/1854992 is the bug that is motiviting the creation of this oslo ping
feature but that feels premature if it is. i think it would be better try to adress this by the sender recreating the
queue if the deliver fails and if that is not viable then protpyope thge fix in nova. if the self ping fixes this
miss queue error then we could extract the cod into oslo.

> To be clear, this is not me nacking the ping feature. I just want to 
> make sure we understand what is going on here so we don't add another 
> unreliable healthchecking mechanism to the one we already have.
> > 
> > Have a nice day,
> > Johannes
> > 

More information about the openstack-discuss mailing list