[nova][cinder][neutron] Heartbeat improvement: checking if main thread responds to rabbit ping
tl;dr: I want to implement main thread ping in the heartbeat tread via rabbitmq to make sure the main thread is alive when reporting a service status, I'd like to have opinions. Hi, I already wrote about this on #openstack-cinder, but nobody replied to me. Anyways, I think it's a topic that is cross-project, so I want to open it in this list. After a rabbitmq outage due to a disk full in our infra, we restarted our rabbitmq cluster. Everything seemed to come back normally, but we later discovered that the main threads of cinder-volume and cinder-backup were dead, while the heartbeat thread was not, and was reporting the services as up. It took us some time to realize it, as this cluster isn't public cloud, and we could live with the errors. Though we don't want to leave things the way they are, as we want reliable monitoring to avoid painful support tickets from customers. Within Cinder, there's already all the infrastructure in place to enhance the hearbeat thread, and make it do more checks. However, currently, the Manager class isn't even instantiated and used when creating the service. Here's my proposal, currently only for Cinder, but hopefully, I'll also implement it in Nova and Neutron. When creating the cinder-volume using cinder/service.py, it's going to also create a manager, and attach it to the service. The ManagerVolume class will implement the is_working() method of the class Manager. In it, it's going to ping the main thread with rabbitmq, sending a ping message, to which the main thread must respond with a pong. If the pong isn't received within a timeout, is_working() will reply False. I started implementing it here: https://review.opendev.org/c/openstack/cinder/+/970796 Note that it's currently still a WIP, and there's more to work on, like some unit tests that are currently failing. I'm not posting the link to the patch to get a merge yet, and it's unfinished, but the current patch is probably enough to get the idea of what I want to implement. So, for each service, the heartbeat will do a ping, which will of course make more messages in Rabbit. Though I believe this is still a reasonable thing to do. With 200 Cinder service, and a heartbeat every 30 seconds per service, that's going to be an additional less than 15 messages per seconds. Considering a reasonable RabbitMQ service can eat 100k message per seconds, it should be ok. Now, this is for Cinder. Once it's done there, I intend to investigate Neutron next, but I'm not sure if it has all the facilities that Cinder offers. Maybe it'd be worth unifying this in oslo.messaging? I'm open to suggestions, but I'm unsure (yet) of what it would look like within one of the oslo libs (I'd have to look on how / where to do it: suggestions are welcome, as it would speed-up my work on it). Please let know your opinion on this, if this is a good idea or not, and if the Cinder team (and potentially Neutron and Nova) will be willing to accept such a patch. Cheers, Thomas Goirand (zigo)
Hey, This sounds like what we introduced years ago with rpc_ping_enabled (see [1], and [2]) Have you tried it? Note that, we used to have it for years in our production clusters, but we finally disabled it for two reasons: 1- it was sending a lot of RMQ messages, because we were monitoring all our agents with this, not only the workers. 2- it was not catching all use cases: the way we implemented it is that only one thread was waiting for ping requests. And most of the time, the ping thread was working correctly, even if some other threads (green threads...... ev..let) were stuck / dead. Cheers, [1] https://review.opendev.org/c/openstack/oslo.messaging/+/735385 [2] https://docs.openstack.org/neutron/2024.1/configuration/neutron.html#DEFAULT... On 12.12.25 - 12:14, Thomas Goirand wrote:
tl;dr: I want to implement main thread ping in the heartbeat tread via rabbitmq to make sure the main thread is alive when reporting a service status, I'd like to have opinions.
Hi,
I already wrote about this on #openstack-cinder, but nobody replied to me. Anyways, I think it's a topic that is cross-project, so I want to open it in this list.
After a rabbitmq outage due to a disk full in our infra, we restarted our rabbitmq cluster. Everything seemed to come back normally, but we later discovered that the main threads of cinder-volume and cinder-backup were dead, while the heartbeat thread was not, and was reporting the services as up. It took us some time to realize it, as this cluster isn't public cloud, and we could live with the errors. Though we don't want to leave things the way they are, as we want reliable monitoring to avoid painful support tickets from customers.
Within Cinder, there's already all the infrastructure in place to enhance the hearbeat thread, and make it do more checks. However, currently, the Manager class isn't even instantiated and used when creating the service.
Here's my proposal, currently only for Cinder, but hopefully, I'll also implement it in Nova and Neutron.
When creating the cinder-volume using cinder/service.py, it's going to also create a manager, and attach it to the service. The ManagerVolume class will implement the is_working() method of the class Manager. In it, it's going to ping the main thread with rabbitmq, sending a ping message, to which the main thread must respond with a pong. If the pong isn't received within a timeout, is_working() will reply False.
I started implementing it here: https://review.opendev.org/c/openstack/cinder/+/970796
Note that it's currently still a WIP, and there's more to work on, like some unit tests that are currently failing. I'm not posting the link to the patch to get a merge yet, and it's unfinished, but the current patch is probably enough to get the idea of what I want to implement.
So, for each service, the heartbeat will do a ping, which will of course make more messages in Rabbit. Though I believe this is still a reasonable thing to do. With 200 Cinder service, and a heartbeat every 30 seconds per service, that's going to be an additional less than 15 messages per seconds. Considering a reasonable RabbitMQ service can eat 100k message per seconds, it should be ok.
Now, this is for Cinder. Once it's done there, I intend to investigate Neutron next, but I'm not sure if it has all the facilities that Cinder offers. Maybe it'd be worth unifying this in oslo.messaging? I'm open to suggestions, but I'm unsure (yet) of what it would look like within one of the oslo libs (I'd have to look on how / where to do it: suggestions are welcome, as it would speed-up my work on it).
Please let know your opinion on this, if this is a good idea or not, and if the Cinder team (and potentially Neutron and Nova) will be willing to accept such a patch.
Cheers,
Thomas Goirand (zigo)
Hi Arnaud, Thanks for your reply. On 12/16/25 10:27 AM, Arnaud Morin wrote:
Hey,
This sounds like what we introduced years ago with rpc_ping_enabled (see [1], and [2])
Have you tried it?
Note that, we used to have it for years in our production clusters, but we finally disabled it for two reasons: 1- it was sending a lot of RMQ messages, because we were monitoring all our agents with this, not only the workers.
According to my calculation, it should be OK with our workload (maybe we'll get 10 messages per second).
2- it was not catching all use cases: the way we implemented it is that only one thread was waiting for ping requests. And most of the time, the ping thread was working correctly, even if some other threads (green threads...... ev..let) were stuck / dead.
Indeed. As we've experienced the heartbeat thread being alive, and the main thread being dead, this is exactly what I'm trying to avoid: I am trying to implement the ping reply in the *main* thread, not the thread doing heartbeat, or a thread that's dedicated to replying to ping. It looks like what I wrote somehow worked: I could see the ping/pong in the cinder-volume logs of the OpenStack CI. Though also, it looks like I implemented it in the wrong class. I should have just modify the is_working() of VolumeManager in cinder/volume/manager.py, instead of cinder/manager.py and cinder/cmd/volume.py, I believe. Now, all is broken again, and I have to fix my patch again. Let's see where this leads me... Cheers, Thomas Goirand (zigo)
participants (2)
-
Arnaud Morin
-
Thomas Goirand