Hi, I am not an expert in these matters but we recently suffered the problem of client deconnection in RabbitMQ due to the heartbeat timeout and I confirm it was a disaster for the cloud usage with many things not working properly (we are running Antelope, except Barbican/Heat/Magnum where we run Caracal). The reason is still not clear for me, it was fixed by increasing the heartbeat timeout but at the same time, my colleague who implemented the change also defined heartbeat_in_pthread=true for all the services, something normally unnecessary as we configure uwsgi or Apache to use only one thread (and several processes). Initially we didn't see any bad impact of this setting but a few days ago users started to report that Magnum cluster creation was failing due a "response timeout" in Heat during the master software deployment. Reading this thread this morning I had the idea if could be the source of the problem (as the service was running properly a couple of weeks ago, before the change). We reverted the change and defined heartbeat_in_pthread=false and it restored the normal behaviour of Heat. We have not seen a negative impact on other services so far. So I confirm that setting this parameter to false by default seems a good idea and that setting it to true can break some services like Heat. Cheers, Michel Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
Hey,
I totally agree about the fact that heartbeat_in_pthread and the oslo.log PipeMutex are technical debt that we need to get rid of, as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side, we believe it's not. I can't prove / reproduce the issue on a small infra, but definetely, at large scale, having those tcp connections to be dropped by rabbitmq and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now I feel we are in a stable situation regarding this, that's why I am surprised about deprecating heartbeat_in_pthread now.
Can we, as least, make sure we keep all of this until we switch off eventlet? In other words, can we get rid of eventlet, then remove this params? and not the opposite?
Regards,
Arnaud
On 01.10.24 - 11:38, smooney@redhat.com wrote:
im glad you managed to make it work but form a nova perspective we do not recommend using heartbeat_in_pthread=true with nova-compute to the point that i woudl cosndier that config unsupported.
we also dont recommend using it with nova-api even when running via a wsgi server such as mod_wsgi or uwsgi.
the only thing this has ever done is remove a cosmetic waring in the rabbit/nova logs due to the heartbeat timing out. This has never fix any functional bug that we were aware of but has resulted in several real bugs.
the most recent we hit was https://launchpad.net/bugs/1983863 which was mitigated by https://review.opendev.org/c/openstack/oslo.log/+/852443 however that uses a unsafe debug option in eventlet eventlet.debug.hub_prevent_multiple_readers(False)
while you may be able to make heartbeat_in_pthread work with a lot of work as Takashi noted this will eventually go away when we remove evently and to enable that removal we need to replace the PipeMutex that currently fixes logging in a native thread so heartbeat_in_pthread is part of the technial debt we need to remvoe to evenrally allow us to move away form eventlet entirly.
On Tue, 2024-10-01 at 09:13 +0000, Arnaud Morin wrote:
Yes, I agree that it used to be broken, but since the bug was reported, we merged the following fixes:
https://review.opendev.org/c/openstack/oslo.messaging/+/894731 https://review.opendev.org/c/openstack/oslo.messaging/+/875615 https://review.opendev.org/c/openstack/oslo.messaging/+/876318
That's why I believe everything should be fine now :)
On 01.10.24 - 17:20, Takashi Kajinami wrote:
I was too fast to push Send button.
It's still interesting to see that you enabled the feature for eventlet services, such as nova-compute. In the past we got a few bugs caused by that feature, which made us eventually revert the default value to False. https://bugs.launchpad.net/oslo.messaging/+bug/1934937 https://bugs.launchpad.net/oslo.messaging/+bug/1949964 https://bugs.launchpad.net/oslo.messaging/+bug/1949964
You might need to check if the reported problem is reproduced in your env.
On 10/1/24 17:15, Takashi Kajinami wrote:
Setting heartbeat_in_pthread is known to break services using eventlet so it SHOULD NOT be enabled by default. We tried to enable it by default in the past but eventually reverted it after seeing multiple problems.
You can selectively disable it for services not using eventlet (api services run by http + mod_wsgi or uwsgi) but should keep it False for the other services.
Once we get rid of eventlet then we no longer use eventlet thread for heartbeat so we no longer need that option (because the behavior would be equivalent to one with heartbeat_in_pthread=True). But until that point we can't change the default, unless someone is willing to dig into the past problems to make the feature completely work with eventlet (which I don't think worth paying effort for at this stage).
On 10/1/24 16:34, Arnaud Morin wrote:
Hello,
I completely miss the deprecation of heartbeat_in_pthread in oslo.messaging [1].
We heavily rely on this parameter downstream and our opinion is that it should be set to True by default. We use it for both wsgi services and agents (nova-compute, neutron agents, etc.).
I understand that eventlet will be dropped in the future, but should we set heartbeat_in_pthread to True by default until then?
Regards,
Arnaud.
[1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778