Hi Arnaud, Thanks for your reply. On 12/16/25 10:27 AM, Arnaud Morin wrote:
Hey,
This sounds like what we introduced years ago with rpc_ping_enabled (see [1], and [2])
Have you tried it?
Note that, we used to have it for years in our production clusters, but we finally disabled it for two reasons: 1- it was sending a lot of RMQ messages, because we were monitoring all our agents with this, not only the workers.
According to my calculation, it should be OK with our workload (maybe we'll get 10 messages per second).
2- it was not catching all use cases: the way we implemented it is that only one thread was waiting for ping requests. And most of the time, the ping thread was working correctly, even if some other threads (green threads...... ev..let) were stuck / dead.
Indeed. As we've experienced the heartbeat thread being alive, and the main thread being dead, this is exactly what I'm trying to avoid: I am trying to implement the ping reply in the *main* thread, not the thread doing heartbeat, or a thread that's dedicated to replying to ping. It looks like what I wrote somehow worked: I could see the ping/pong in the cinder-volume logs of the OpenStack CI. Though also, it looks like I implemented it in the wrong class. I should have just modify the is_working() of VolumeManager in cinder/volume/manager.py, instead of cinder/manager.py and cinder/cmd/volume.py, I believe. Now, all is broken again, and I have to fix my patch again. Let's see where this leads me... Cheers, Thomas Goirand (zigo)