Re: [oslo.messaging] Heartbeat in pthread

2 Oct 2024

      On Tue, 2024-10-01 at 22:32 +0200, Michel Jouvin wrote:
...
Hi,
I am not an expert in these matters but we recently suffered the problem 
of client deconnection in RabbitMQ due to the heartbeat timeout and I 
confirm it was a disaster for the cloud usage with many things not 
working properly (we are running Antelope, except Barbican/Heat/Magnum 
where we run Caracal). The reason is still not clear for me, it was 
fixed by increasing the heartbeat timeout but at the same time, my 
colleague who implemented the change also defined 
heartbeat_in_pthread=true for all the services, something normally 
unnecessary as we configure uwsgi or Apache to use only one thread (and 
several processes). Initially we didn't see any bad impact of this 
setting but a few days ago users started to report that Magnum cluster 
creation was failing due a "response timeout" in Heat during the master 
software deployment.
Reading this thread this morning I had the idea if could be the source 
of the problem (as the service was running properly a couple of weeks 
ago, before the change). We reverted the change and defined 
heartbeat_in_pthread=false and it restored the normal behaviour of Heat. 
We have not seen a negative impact on other services so far. So I 
confirm that setting this parameter to false by default seems a good 
idea and that setting it to true can break some services like Heat.
thank you for the data point, im sure you will monitor the situration in your
clodu but please let use know in a week or two if the heat/magnum issues you
obsevered retrun or if the could continue to fucntion normally, 
i expect it to but again it would be a good data point.
...
Cheers,
Michel
Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
...
Hey,
I totally agree about the fact that heartbeat_in_pthread and the
oslo.log PipeMutex are technical debt that we need to get rid of,
as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side,
we believe it's not.
I can't prove / reproduce the issue on a small infra, but definetely,
at large scale, having those tcp connections to be dropped by rabbitmq
and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now
I feel we are in a stable situation regarding this, that's why I am
surprised about deprecating heartbeat_in_pthread now.
deprecateign a config option requires the deprecation to be advertised in a slrup
before it can then be removed in a follwoing release.
Given the deprecation was done in dalmaition 2024.2 which is not a slurp release the removal
cannot take effect in 2025.1, 2025.2 is the earliest release we could remove this option.

as a result i think maintaining the deprecation is correct here.
we may decied not to remove this until 2026.1 or later but i think its correct to send the
message that people should avoid cahnging this option form tis default fo False.
we coudl even tag this option as advanced to make that more clear
https://docs.openstack.org/oslo.config/latest/reference/defining.html#advanc...
...
...
Can we, as least, make sure we keep all of this until we switch off
eventlet?
In other words, can we get rid of eventlet, then remove this params?
and not the opposite?
Regards,
Arnaud

Re: [oslo.messaging] Heartbeat in pthread

smooney＠redhat.com