Re: [oslo.messaging] Heartbeat in pthread

1 Oct 2024

      Hi,

I am not an expert in these matters but we recently suffered the problem 
of client deconnection in RabbitMQ due to the heartbeat timeout and I 
confirm it was a disaster for the cloud usage with many things not 
working properly (we are running Antelope, except Barbican/Heat/Magnum 
where we run Caracal). The reason is still not clear for me, it was 
fixed by increasing the heartbeat timeout but at the same time, my 
colleague who implemented the change also defined 
heartbeat_in_pthread=true for all the services, something normally 
unnecessary as we configure uwsgi or Apache to use only one thread (and 
several processes). Initially we didn't see any bad impact of this 
setting but a few days ago users started to report that Magnum cluster 
creation was failing due a "response timeout" in Heat during the master 
software deployment.

Reading this thread this morning I had the idea if could be the source 
of the problem (as the service was running properly a couple of weeks 
ago, before the change). We reverted the change and defined 
heartbeat_in_pthread=false and it restored the normal behaviour of Heat. 
We have not seen a negative impact on other services so far. So I 
confirm that setting this parameter to false by default seems a good 
idea and that setting it to true can break some services like Heat.

Cheers,

Michel

Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
...
Hey,
I totally agree about the fact that heartbeat_in_pthread and the
oslo.log PipeMutex are technical debt that we need to get rid of,
as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side,
we believe it's not.
I can't prove / reproduce the issue on a small infra, but definetely,
at large scale, having those tcp connections to be dropped by rabbitmq
and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now
I feel we are in a stable situation regarding this, that's why I am
surprised about deprecating heartbeat_in_pthread now.
Can we, as least, make sure we keep all of this until we switch off
eventlet?
In other words, can we get rid of eventlet, then remove this params?
and not the opposite?
Regards,
Arnaud
On 01.10.24 - 11:38, smooney@redhat.com wrote:
...
im glad you managed to make it work but form a nova perspective we
do not recommend using heartbeat_in_pthread=true with nova-compute to the
point that i woudl cosndier that config unsupported.
we also dont recommend using it with nova-api even when running via a wsgi server such as mod_wsgi
or uwsgi.
the only thing this has ever done is remove a cosmetic waring in the rabbit/nova logs
due to the heartbeat timing out. This has never fix any functional bug that
we were aware of but has resulted in several real bugs.
the most recent we hit was https://launchpad.net/bugs/1983863 which was mitigated by
https://review.opendev.org/c/openstack/oslo.log/+/852443 however that uses a unsafe debug
option in eventlet eventlet.debug.hub_prevent_multiple_readers(False)
while you may be able to make heartbeat_in_pthread work with a lot of work
as Takashi noted this will eventually go away when we remove evently and to enable that removal
we need to replace the PipeMutex that currently fixes logging in a native thread so
heartbeat_in_pthread is part of the technial debt we need to remvoe to evenrally allow
us to move away form eventlet entirly.
On Tue, 2024-10-01 at 09:13 +0000, Arnaud Morin wrote:
...
Yes, I agree that it used to be broken, but since the bug was reported,
we merged the following fixes:
https://review.opendev.org/c/openstack/oslo.messaging/+/894731
https://review.opendev.org/c/openstack/oslo.messaging/+/875615
https://review.opendev.org/c/openstack/oslo.messaging/+/876318
That's why I believe everything should be fine now :)
On 01.10.24 - 17:20, Takashi Kajinami wrote:
...
I was too fast to push Send button.
It's still interesting to see that you enabled the feature for eventlet services,
such as nova-compute. In the past we got a few bugs caused by that feature,
which made us eventually revert the default value to False.
  https://bugs.launchpad.net/oslo.messaging/+bug/1934937
  https://bugs.launchpad.net/oslo.messaging/+bug/1949964
  https://bugs.launchpad.net/oslo.messaging/+bug/1949964
You might need to check if the reported problem is reproduced in your env.
On 10/1/24 17:15, Takashi Kajinami wrote:
...
Setting heartbeat_in_pthread is known to break services using eventlet
so it SHOULD NOT be enabled by default. We tried to enable it by default
in the past but eventually reverted it after seeing multiple problems.
You can selectively disable it for services not using eventlet (api
services run by http + mod_wsgi or uwsgi) but should keep it False for
the other services.
Once we get rid of eventlet then we no longer use eventlet thread for
heartbeat so we no longer need that option (because the behavior would
be equivalent to one with heartbeat_in_pthread=True). But until that point
we can't change the default, unless someone is willing to dig into
the past problems to make the feature completely work with eventlet (which
I don't think worth paying effort for at this stage).
On 10/1/24 16:34, Arnaud Morin wrote:
...
Hello,
I completely miss the deprecation of heartbeat_in_pthread in
oslo.messaging [1].
We heavily rely on this parameter downstream and our opinion is that it
should be set to True by default. We use it for both wsgi services and
agents (nova-compute, neutron agents, etc.).
I understand that eventlet will be dropped in the future, but should we
set heartbeat_in_pthread to True by default until then?
Regards,
Arnaud.
[1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778

Re: [oslo.messaging] Heartbeat in pthread

Michel Jouvin