[nova][cinder][neutron] Heartbeat improvement: checking if main thread responds to rabbit ping

12 Dec 2025

      tl;dr: I want to implement main thread ping in the heartbeat tread via 
rabbitmq to make sure the main thread is alive when reporting a service 
status, I'd like to have opinions.

Hi,

I already wrote about this on #openstack-cinder, but nobody replied to 
me. Anyways, I think it's a topic that is cross-project, so I want to 
open it in this list.

After a rabbitmq outage due to a disk full in our infra, we restarted 
our rabbitmq cluster. Everything seemed to come back normally, but we 
later discovered that the main threads of cinder-volume and 
cinder-backup were dead, while the heartbeat thread was not, and was 
reporting the services as up. It took us some time to realize it, as 
this cluster isn't public cloud, and we could live with the errors. 
Though we don't want to leave things the way they are, as we want 
reliable monitoring to avoid painful support tickets from customers.

Within Cinder, there's already all the infrastructure in place to 
enhance the hearbeat thread, and make it do more checks. However, 
currently, the Manager class isn't even instantiated and used when 
creating the service.

Here's my proposal, currently only for Cinder, but hopefully, I'll also 
implement it in Nova and Neutron.

When creating the cinder-volume using cinder/service.py, it's going to 
also create a manager, and attach it to the service. The ManagerVolume 
class will implement the is_working() method of the class Manager. In 
it, it's going to ping the main thread with rabbitmq, sending a ping 
message, to which the main thread must respond with a pong. If the pong 
isn't received within a timeout, is_working() will reply False.

I started implementing it here:
https://review.opendev.org/c/openstack/cinder/+/970796

Note that it's currently still a WIP, and there's more to work on, like 
some unit tests that are currently failing. I'm not posting the link to 
the patch to get a merge yet, and it's unfinished, but the current patch 
is probably enough to get the idea of what I want to implement.

So, for each service, the heartbeat will do a ping, which will of course 
make more messages in Rabbit. Though I believe this is still a 
reasonable thing to do. With 200 Cinder service, and a heartbeat every 
30 seconds per service, that's going to be an additional less than 15 
messages per seconds. Considering a reasonable RabbitMQ service can eat 
100k message per seconds, it should be ok.

Now, this is for Cinder. Once it's done there, I intend to investigate 
Neutron next, but I'm not sure if it has all the facilities that Cinder 
offers. Maybe it'd be worth unifying this in oslo.messaging? I'm open to 
suggestions, but I'm unsure (yet) of what it would look like within one 
of the oslo libs (I'd have to look on how / where to do it: suggestions 
are welcome, as it would speed-up my work on it).

Please let know your opinion on this, if this is a good idea or not, and 
if the Cinder team (and potentially Neutron and Nova) will be willing to 
accept such a patch.

Cheers,

Thomas Goirand (zigo)

Thomas Goirand

Arnaud Morin

Thomas Goirand

Thomas Goirand

Sean Mooney

tags

participants (3)