Hi Sean,
Not sure why we misunderstood eachother but we agree! I understood your
sentence as "people should avoid changing this option from its default
(False) to True." but I understand now you mean the opposite and I
totally agree based on our experience. Heat seems to be another service
that will be in trouble if it is changed.
Michel
Le 02/10/2024 à 14:17, smooney@redhat.com a écrit :
> On Wed, 2024-10-02 at 13:06 +0200, Michel Jouvin wrote:
>> Hi Sean,
>>
>> As for the situation in our cloud after reverting to
>> heartbeat_in_pthread=false, it was a "black and white situation":
>> creating a cluster was impossible (because of the Heat issue mentioned)
>> since we changed to heartbeat_in_pthread=true (but we didn't realize
>> immediately as we don't create clusters everyday) and restarted to work
>> properly immediately after reverting to heartbeat_in_pthread=false.
>> There is a clear link between this parameter and Heat behaviour (Caracal
>> version in our case, so Oslo client 24.0).
>>
>> As for your last sentence "people should avoid cahnging this option form
>> tis default fo False.", I think/hope you wanted to say the opposite :
>> "people should avoid changing this option form its default to True."...
> no at least for nova we default to false in our downstream product
>
> https://github.com/openstack-k8s-operators/nova-operator/blob/main/templates/nova.conf#L62-L69
>
> we had signifcant ci usee when we had it set to true orgianly because of the oslo.log issue
> but we didn not revert to enabling this for nova-api after that was backported because
> we have not seen any sideefct from seting it to false.
>
> we have never set this to true in OSP to my knolage for nova-comptue
>
> puppet-nova considered it experimental
> https://opendev.org/openstack/puppet-nova/src/commit/17bd61e042591305e461e5c9c29ecf250d7b9936/manifests/init.pp#L60-L68
>
> in tripleo we disabled it in many services including heat and nova
> https://github.com/openstack-archive/tripleo-heat-templates/commit/cf4d4f881a1bf7011a3eae604eb83c8900f1b1a4
>
> in kolla it also default to false for nova-compute and other eventlet services
> https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f7131b08033d2ff/ansible/roles/nova-cell/templates/nova.conf.j2#L193
> although it is enabel for nova-api and some heat compoents
> https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f7131b08033d2ff/ansible/roles/nova/templates/nova.conf.j2#L142
> https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f7131b08033d2ff/ansible/roles/heat/templates/heat.conf.j2#L75
>
> at least for nova i would not recommend using
>
> heartbeat_in_pthread = True for any serrvice
> nova-api runnign under uwsgi or mod_wsgi is the only possibel excpetion and even then i woudl discurage it.
>
> i cant really speak to other services but i think `heartbeat_in_pthread = false` is generally the correct default.
>
>
>> Michel
>>
>> Le 02/10/2024 à 11:44, smooney@redhat.com a écrit :
>>> On Tue, 2024-10-01 at 22:32 +0200, Michel Jouvin wrote:
>>>> Hi,
>>>>
>>>> I am not an expert in these matters but we recently suffered the problem
>>>> of client deconnection in RabbitMQ due to the heartbeat timeout and I
>>>> confirm it was a disaster for the cloud usage with many things not
>>>> working properly (we are running Antelope, except Barbican/Heat/Magnum
>>>> where we run Caracal). The reason is still not clear for me, it was
>>>> fixed by increasing the heartbeat timeout but at the same time, my
>>>> colleague who implemented the change also defined
>>>> heartbeat_in_pthread=true for all the services, something normally
>>>> unnecessary as we configure uwsgi or Apache to use only one thread (and
>>>> several processes). Initially we didn't see any bad impact of this
>>>> setting but a few days ago users started to report that Magnum cluster
>>>> creation was failing due a "response timeout" in Heat during the master
>>>> software deployment.
>>>>
>>>> Reading this thread this morning I had the idea if could be the source
>>>> of the problem (as the service was running properly a couple of weeks
>>>> ago, before the change). We reverted the change and defined
>>>> heartbeat_in_pthread=false and it restored the normal behaviour of Heat.
>>>> We have not seen a negative impact on other services so far. So I
>>>> confirm that setting this parameter to false by default seems a good
>>>> idea and that setting it to true can break some services like Heat.
>>> thank you for the data point, im sure you will monitor the situration in your
>>> clodu but please let use know in a week or two if the heat/magnum issues you
>>> obsevered retrun or if the could continue to fucntion normally,
>>> i expect it to but again it would be a good data point.
>>>> Cheers,
>>>>
>>>> Michel
>>>>
>>>> Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
>>>>> Hey,
>>>>>
>>>>> I totally agree about the fact that heartbeat_in_pthread and the
>>>>> oslo.log PipeMutex are technical debt that we need to get rid of,
>>>>> as well as eventlet.
>>>>>
>>>>> However, despite the fact that it seems purely cosmetic on your side,
>>>>> we believe it's not.
>>>>> I can't prove / reproduce the issue on a small infra, but definetely,
>>>>> at large scale, having those tcp connections to be dropped by rabbitmq
>>>>> and recreated in a loop by agents is affecting the cluster.
>>>>>
>>>>> I know all the pain that these settings introduced in the past, but now
>>>>> I feel we are in a stable situation regarding this, that's why I am
>>>>> surprised about deprecating heartbeat_in_pthread now.
>>> deprecateign a config option requires the deprecation to be advertised in a slrup
>>> before it can then be removed in a follwoing release.
>>> Given the deprecation was done in dalmaition 2024.2 which is not a slurp release the removal
>>> cannot take effect in 2025.1, 2025.2 is the earliest release we could remove this option.
>>>
>>> as a result i think maintaining the deprecation is correct here.
>>> we may decied not to remove this until 2026.1 or later but i think its correct to send the
>>> message that people should avoid cahnging this option form tis default fo False.
>>> we coudl even tag this option as advanced to make that more clear
>>> https://docs.openstack.org/oslo.config/latest/reference/defining.html#advanced-option
>>>
>>>>> Can we, as least, make sure we keep all of this until we switch off
>>>>> eventlet?
>>>>> In other words, can we get rid of eventlet, then remove this params?
>>>>> and not the opposite?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Arnaud