Voici la version mise à jour avec la solution de **retry backoff** et les liens supplémentaires :

Hey folks,

The major concern behind this thread is that RabbitMQ connection drops due to the absence of a reliable heartbeat.
While `heartbeat_in_pthread=True` aimed to fix this, it introduced other bugs.

Indeed, the Greenlet documentation is pretty clear, the limitations between Python threads and greenlets lead to issues.
As eventlet is itself based on greenlet it leads to recurring issues in our stacks.
The heartbeat_in_pthread bugs are living examples of this kind of issue.

For this reason, we support keeping `heartbeat_in_pthread` disabled by default.

As a workaround, adjusting the RabbitMQ `heartbeat_timeout` and `rabbit_heartbeat_timeout_threshold` can mitigate connection drops.

Additionally, oslo.messaging offers the `connection_retry_interval` and `connection_retry_backoff` parameters,
which implement retry backoff strategies to better handle connection drops.
This ensures that the system can manage reconnections more efficiently.

We encourage investigating these paths to mitigate the connection problems.

For more details please read:
- https://greenlet.readthedocs.io/en/latest/python_threads.html
- https://docs.openstack.org/oslo.messaging/xena/configuration/opts.html#oslo_messaging_rabbit.heartbeat_timeout_threshold
- https://www.rabbitmq.com/docs/heartbeats
- https://docs.openstack.org/oslo.messaging/xena/configuration/opts.html#oslo_messaging_amqp.connection_retry_backoff
- https://docs.openstack.org/oslo.messaging/xena/configuration/opts.html#oslo_messaging_amqp.connection_retry_interval


Le jeu. 3 oct. 2024 à 22:46, Michel Jouvin <michel.jouvin@ijclab.in2p3.fr> a écrit :
Hi Sean,

Not sure why we misunderstood eachother but we agree! I understood your
sentence as "people should avoid changing this option from its default
(False) to True." but I understand now you mean the opposite and I
totally agree based on our experience. Heat seems to be another service
that will be in trouble if it is changed.

Michel

Le 02/10/2024 à 14:17, smooney@redhat.com a écrit :
> On Wed, 2024-10-02 at 13:06 +0200, Michel Jouvin wrote:
>> Hi Sean,
>>
>> As for the situation in our cloud after reverting to
>> heartbeat_in_pthread=false, it was a "black and white situation":
>> creating a cluster was impossible (because of the Heat issue mentioned)
>> since we changed to heartbeat_in_pthread=true (but we didn't realize
>> immediately as we don't create clusters everyday) and restarted to work
>> properly immediately after reverting to heartbeat_in_pthread=false.
>> There is a clear link between this parameter and Heat behaviour (Caracal
>> version in our case, so Oslo client 24.0).
>>
>> As for your last sentence "people should avoid cahnging this option form
>> tis default fo False.", I think/hope you wanted to say the opposite :
>> "people should avoid changing this option form its default to True."...
> no at least for nova we default to false in our downstream product
>
> https://github.com/openstack-k8s-operators/nova-operator/blob/main/templates/nova.conf#L62-L69
>
> we had signifcant ci usee when we had it set to true orgianly because of the oslo.log issue
> but we didn not revert to enabling this for nova-api after that was backported because
> we have not seen any sideefct from seting it to false.
>
> we have never set this to true in OSP to my knolage for nova-comptue
>
> puppet-nova considered it experimental
> https://opendev.org/openstack/puppet-nova/src/commit/17bd61e042591305e461e5c9c29ecf250d7b9936/manifests/init.pp#L60-L68
>
> in tripleo we disabled it in many services including heat and nova
> https://github.com/openstack-archive/tripleo-heat-templates/commit/cf4d4f881a1bf7011a3eae604eb83c8900f1b1a4
>
> in kolla it also default to false for nova-compute and other eventlet services
> https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f7131b08033d2ff/ansible/roles/nova-cell/templates/nova.conf.j2#L193
> although it is enabel for nova-api and some heat compoents
> https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f7131b08033d2ff/ansible/roles/nova/templates/nova.conf.j2#L142
> https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f7131b08033d2ff/ansible/roles/heat/templates/heat.conf.j2#L75
>
> at least for nova i would not recommend using
>
> heartbeat_in_pthread = True for any serrvice
> nova-api runnign under uwsgi or mod_wsgi is the only possibel excpetion and even then i woudl discurage it.
>
> i cant really speak to other services but i think `heartbeat_in_pthread = false` is generally the correct default.
>
>
>> Michel
>>
>> Le 02/10/2024 à 11:44, smooney@redhat.com a écrit :
>>> On Tue, 2024-10-01 at 22:32 +0200, Michel Jouvin wrote:
>>>> Hi,
>>>>
>>>> I am not an expert in these matters but we recently suffered the problem
>>>> of client deconnection in RabbitMQ due to the heartbeat timeout and I
>>>> confirm it was a disaster for the cloud usage with many things not
>>>> working properly (we are running Antelope, except Barbican/Heat/Magnum
>>>> where we run Caracal). The reason is still not clear for me, it was
>>>> fixed by increasing the heartbeat timeout but at the same time, my
>>>> colleague who implemented the change also defined
>>>> heartbeat_in_pthread=true for all the services, something normally
>>>> unnecessary as we configure uwsgi or Apache to use only one thread (and
>>>> several processes). Initially we didn't see any bad impact of this
>>>> setting but a few days ago users started to report that Magnum cluster
>>>> creation was failing due a "response timeout" in Heat during the master
>>>> software deployment.
>>>>
>>>> Reading this thread this morning I had the idea if could be the source
>>>> of the problem (as the service was running properly a couple of weeks
>>>> ago, before the change). We reverted the change and defined
>>>> heartbeat_in_pthread=false and it restored the normal behaviour of Heat.
>>>> We have not seen a negative impact on other services so far. So I
>>>> confirm that setting this parameter to false by default seems a good
>>>> idea and that setting it to true can break some services like Heat.
>>> thank you for the data point, im sure you will monitor the situration in your
>>> clodu but please let use know in a week or two if the heat/magnum issues you
>>> obsevered retrun or if the could continue to fucntion normally,
>>> i expect it to but again it would be a good data point.
>>>> Cheers,
>>>>
>>>> Michel
>>>>
>>>> Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
>>>>> Hey,
>>>>>
>>>>> I totally agree about the fact that heartbeat_in_pthread and the
>>>>> oslo.log PipeMutex are technical debt that we need to get rid of,
>>>>> as well as eventlet.
>>>>>
>>>>> However, despite the fact that it seems purely cosmetic on your side,
>>>>> we believe it's not.
>>>>> I can't prove / reproduce the issue on a small infra, but definetely,
>>>>> at large scale, having those tcp connections to be dropped by rabbitmq
>>>>> and recreated in a loop by agents is affecting the cluster.
>>>>>
>>>>> I know all the pain that these settings introduced in the past, but now
>>>>> I feel we are in a stable situation regarding this, that's why I am
>>>>> surprised about deprecating heartbeat_in_pthread now.
>>> deprecateign a config option requires the deprecation to be advertised in a slrup
>>> before it can then be removed in a follwoing release.
>>> Given the deprecation was done in dalmaition 2024.2 which is not a slurp release the removal
>>> cannot take effect in 2025.1, 2025.2 is the earliest release we could remove this option.
>>>
>>> as a result i think maintaining the deprecation is correct here.
>>> we may decied not to remove this until 2026.1 or later but i think its correct to send the
>>> message that people should avoid cahnging this option form tis default fo False.
>>> we coudl even tag this option as advanced to make that more clear
>>> https://docs.openstack.org/oslo.config/latest/reference/defining.html#advanced-option
>>>
>>>>> Can we, as least, make sure we keep all of this until we switch off
>>>>> eventlet?
>>>>> In other words, can we get rid of eventlet, then remove this params?
>>>>> and not the opposite?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Arnaud



--
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud