[oslo.messaging] Heartbeat in pthread
Hello, I completely miss the deprecation of heartbeat_in_pthread in oslo.messaging [1]. We heavily rely on this parameter downstream and our opinion is that it should be set to True by default. We use it for both wsgi services and agents (nova-compute, neutron agents, etc.). I understand that eventlet will be dropped in the future, but should we set heartbeat_in_pthread to True by default until then? Regards, Arnaud. [1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778
Setting heartbeat_in_pthread is known to break services using eventlet so it SHOULD NOT be enabled by default. We tried to enable it by default in the past but eventually reverted it after seeing multiple problems. You can selectively disable it for services not using eventlet (api services run by http + mod_wsgi or uwsgi) but should keep it False for the other services. Once we get rid of eventlet then we no longer use eventlet thread for heartbeat so we no longer need that option (because the behavior would be equivalent to one with heartbeat_in_pthread=True). But until that point we can't change the default, unless someone is willing to dig into the past problems to make the feature completely work with eventlet (which I don't think worth paying effort for at this stage). On 10/1/24 16:34, Arnaud Morin wrote:
Hello,
I completely miss the deprecation of heartbeat_in_pthread in oslo.messaging [1].
We heavily rely on this parameter downstream and our opinion is that it should be set to True by default. We use it for both wsgi services and agents (nova-compute, neutron agents, etc.).
I understand that eventlet will be dropped in the future, but should we set heartbeat_in_pthread to True by default until then?
Regards,
Arnaud.
[1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778
I was too fast to push Send button. It's still interesting to see that you enabled the feature for eventlet services, such as nova-compute. In the past we got a few bugs caused by that feature, which made us eventually revert the default value to False. https://bugs.launchpad.net/oslo.messaging/+bug/1934937 https://bugs.launchpad.net/oslo.messaging/+bug/1949964 https://bugs.launchpad.net/oslo.messaging/+bug/1949964 You might need to check if the reported problem is reproduced in your env. On 10/1/24 17:15, Takashi Kajinami wrote:
Setting heartbeat_in_pthread is known to break services using eventlet so it SHOULD NOT be enabled by default. We tried to enable it by default in the past but eventually reverted it after seeing multiple problems.
You can selectively disable it for services not using eventlet (api services run by http + mod_wsgi or uwsgi) but should keep it False for the other services.
Once we get rid of eventlet then we no longer use eventlet thread for heartbeat so we no longer need that option (because the behavior would be equivalent to one with heartbeat_in_pthread=True). But until that point we can't change the default, unless someone is willing to dig into the past problems to make the feature completely work with eventlet (which I don't think worth paying effort for at this stage).
On 10/1/24 16:34, Arnaud Morin wrote:
Hello,
I completely miss the deprecation of heartbeat_in_pthread in oslo.messaging [1].
We heavily rely on this parameter downstream and our opinion is that it should be set to True by default. We use it for both wsgi services and agents (nova-compute, neutron agents, etc.).
I understand that eventlet will be dropped in the future, but should we set heartbeat_in_pthread to True by default until then?
Regards,
Arnaud.
[1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778
Yes, I agree that it used to be broken, but since the bug was reported, we merged the following fixes: https://review.opendev.org/c/openstack/oslo.messaging/+/894731 https://review.opendev.org/c/openstack/oslo.messaging/+/875615 https://review.opendev.org/c/openstack/oslo.messaging/+/876318 That's why I believe everything should be fine now :) On 01.10.24 - 17:20, Takashi Kajinami wrote:
I was too fast to push Send button.
It's still interesting to see that you enabled the feature for eventlet services, such as nova-compute. In the past we got a few bugs caused by that feature, which made us eventually revert the default value to False. https://bugs.launchpad.net/oslo.messaging/+bug/1934937 https://bugs.launchpad.net/oslo.messaging/+bug/1949964 https://bugs.launchpad.net/oslo.messaging/+bug/1949964
You might need to check if the reported problem is reproduced in your env.
On 10/1/24 17:15, Takashi Kajinami wrote:
Setting heartbeat_in_pthread is known to break services using eventlet so it SHOULD NOT be enabled by default. We tried to enable it by default in the past but eventually reverted it after seeing multiple problems.
You can selectively disable it for services not using eventlet (api services run by http + mod_wsgi or uwsgi) but should keep it False for the other services.
Once we get rid of eventlet then we no longer use eventlet thread for heartbeat so we no longer need that option (because the behavior would be equivalent to one with heartbeat_in_pthread=True). But until that point we can't change the default, unless someone is willing to dig into the past problems to make the feature completely work with eventlet (which I don't think worth paying effort for at this stage).
On 10/1/24 16:34, Arnaud Morin wrote:
Hello,
I completely miss the deprecation of heartbeat_in_pthread in oslo.messaging [1].
We heavily rely on this parameter downstream and our opinion is that it should be set to True by default. We use it for both wsgi services and agents (nova-compute, neutron agents, etc.).
I understand that eventlet will be dropped in the future, but should we set heartbeat_in_pthread to True by default until then?
Regards,
Arnaud.
[1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778
im glad you managed to make it work but form a nova perspective we do not recommend using heartbeat_in_pthread=true with nova-compute to the point that i woudl cosndier that config unsupported. we also dont recommend using it with nova-api even when running via a wsgi server such as mod_wsgi or uwsgi. the only thing this has ever done is remove a cosmetic waring in the rabbit/nova logs due to the heartbeat timing out. This has never fix any functional bug that we were aware of but has resulted in several real bugs. the most recent we hit was https://launchpad.net/bugs/1983863 which was mitigated by https://review.opendev.org/c/openstack/oslo.log/+/852443 however that uses a unsafe debug option in eventlet eventlet.debug.hub_prevent_multiple_readers(False) while you may be able to make heartbeat_in_pthread work with a lot of work as Takashi noted this will eventually go away when we remove evently and to enable that removal we need to replace the PipeMutex that currently fixes logging in a native thread so heartbeat_in_pthread is part of the technial debt we need to remvoe to evenrally allow us to move away form eventlet entirly. On Tue, 2024-10-01 at 09:13 +0000, Arnaud Morin wrote:
Yes, I agree that it used to be broken, but since the bug was reported, we merged the following fixes:
https://review.opendev.org/c/openstack/oslo.messaging/+/894731 https://review.opendev.org/c/openstack/oslo.messaging/+/875615 https://review.opendev.org/c/openstack/oslo.messaging/+/876318
That's why I believe everything should be fine now :)
On 01.10.24 - 17:20, Takashi Kajinami wrote:
I was too fast to push Send button.
It's still interesting to see that you enabled the feature for eventlet services, such as nova-compute. In the past we got a few bugs caused by that feature, which made us eventually revert the default value to False. https://bugs.launchpad.net/oslo.messaging/+bug/1934937 https://bugs.launchpad.net/oslo.messaging/+bug/1949964 https://bugs.launchpad.net/oslo.messaging/+bug/1949964
You might need to check if the reported problem is reproduced in your env.
On 10/1/24 17:15, Takashi Kajinami wrote:
Setting heartbeat_in_pthread is known to break services using eventlet so it SHOULD NOT be enabled by default. We tried to enable it by default in the past but eventually reverted it after seeing multiple problems.
You can selectively disable it for services not using eventlet (api services run by http + mod_wsgi or uwsgi) but should keep it False for the other services.
Once we get rid of eventlet then we no longer use eventlet thread for heartbeat so we no longer need that option (because the behavior would be equivalent to one with heartbeat_in_pthread=True). But until that point we can't change the default, unless someone is willing to dig into the past problems to make the feature completely work with eventlet (which I don't think worth paying effort for at this stage).
On 10/1/24 16:34, Arnaud Morin wrote:
Hello,
I completely miss the deprecation of heartbeat_in_pthread in oslo.messaging [1].
We heavily rely on this parameter downstream and our opinion is that it should be set to True by default. We use it for both wsgi services and agents (nova-compute, neutron agents, etc.).
I understand that eventlet will be dropped in the future, but should we set heartbeat_in_pthread to True by default until then?
Regards,
Arnaud.
[1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778
Hey, I totally agree about the fact that heartbeat_in_pthread and the oslo.log PipeMutex are technical debt that we need to get rid of, as well as eventlet. However, despite the fact that it seems purely cosmetic on your side, we believe it's not. I can't prove / reproduce the issue on a small infra, but definetely, at large scale, having those tcp connections to be dropped by rabbitmq and recreated in a loop by agents is affecting the cluster. I know all the pain that these settings introduced in the past, but now I feel we are in a stable situation regarding this, that's why I am surprised about deprecating heartbeat_in_pthread now. Can we, as least, make sure we keep all of this until we switch off eventlet? In other words, can we get rid of eventlet, then remove this params? and not the opposite? Regards, Arnaud On 01.10.24 - 11:38, smooney@redhat.com wrote:
im glad you managed to make it work but form a nova perspective we do not recommend using heartbeat_in_pthread=true with nova-compute to the point that i woudl cosndier that config unsupported.
we also dont recommend using it with nova-api even when running via a wsgi server such as mod_wsgi or uwsgi.
the only thing this has ever done is remove a cosmetic waring in the rabbit/nova logs due to the heartbeat timing out. This has never fix any functional bug that we were aware of but has resulted in several real bugs.
the most recent we hit was https://launchpad.net/bugs/1983863 which was mitigated by https://review.opendev.org/c/openstack/oslo.log/+/852443 however that uses a unsafe debug option in eventlet eventlet.debug.hub_prevent_multiple_readers(False)
while you may be able to make heartbeat_in_pthread work with a lot of work as Takashi noted this will eventually go away when we remove evently and to enable that removal we need to replace the PipeMutex that currently fixes logging in a native thread so heartbeat_in_pthread is part of the technial debt we need to remvoe to evenrally allow us to move away form eventlet entirly.
On Tue, 2024-10-01 at 09:13 +0000, Arnaud Morin wrote:
Yes, I agree that it used to be broken, but since the bug was reported, we merged the following fixes:
https://review.opendev.org/c/openstack/oslo.messaging/+/894731 https://review.opendev.org/c/openstack/oslo.messaging/+/875615 https://review.opendev.org/c/openstack/oslo.messaging/+/876318
That's why I believe everything should be fine now :)
On 01.10.24 - 17:20, Takashi Kajinami wrote:
I was too fast to push Send button.
It's still interesting to see that you enabled the feature for eventlet services, such as nova-compute. In the past we got a few bugs caused by that feature, which made us eventually revert the default value to False. https://bugs.launchpad.net/oslo.messaging/+bug/1934937 https://bugs.launchpad.net/oslo.messaging/+bug/1949964 https://bugs.launchpad.net/oslo.messaging/+bug/1949964
You might need to check if the reported problem is reproduced in your env.
On 10/1/24 17:15, Takashi Kajinami wrote:
Setting heartbeat_in_pthread is known to break services using eventlet so it SHOULD NOT be enabled by default. We tried to enable it by default in the past but eventually reverted it after seeing multiple problems.
You can selectively disable it for services not using eventlet (api services run by http + mod_wsgi or uwsgi) but should keep it False for the other services.
Once we get rid of eventlet then we no longer use eventlet thread for heartbeat so we no longer need that option (because the behavior would be equivalent to one with heartbeat_in_pthread=True). But until that point we can't change the default, unless someone is willing to dig into the past problems to make the feature completely work with eventlet (which I don't think worth paying effort for at this stage).
On 10/1/24 16:34, Arnaud Morin wrote:
Hello,
I completely miss the deprecation of heartbeat_in_pthread in oslo.messaging [1].
We heavily rely on this parameter downstream and our opinion is that it should be set to True by default. We use it for both wsgi services and agents (nova-compute, neutron agents, etc.).
I understand that eventlet will be dropped in the future, but should we set heartbeat_in_pthread to True by default until then?
Regards,
Arnaud.
[1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778
Hi, I am not an expert in these matters but we recently suffered the problem of client deconnection in RabbitMQ due to the heartbeat timeout and I confirm it was a disaster for the cloud usage with many things not working properly (we are running Antelope, except Barbican/Heat/Magnum where we run Caracal). The reason is still not clear for me, it was fixed by increasing the heartbeat timeout but at the same time, my colleague who implemented the change also defined heartbeat_in_pthread=true for all the services, something normally unnecessary as we configure uwsgi or Apache to use only one thread (and several processes). Initially we didn't see any bad impact of this setting but a few days ago users started to report that Magnum cluster creation was failing due a "response timeout" in Heat during the master software deployment. Reading this thread this morning I had the idea if could be the source of the problem (as the service was running properly a couple of weeks ago, before the change). We reverted the change and defined heartbeat_in_pthread=false and it restored the normal behaviour of Heat. We have not seen a negative impact on other services so far. So I confirm that setting this parameter to false by default seems a good idea and that setting it to true can break some services like Heat. Cheers, Michel Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
Hey,
I totally agree about the fact that heartbeat_in_pthread and the oslo.log PipeMutex are technical debt that we need to get rid of, as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side, we believe it's not. I can't prove / reproduce the issue on a small infra, but definetely, at large scale, having those tcp connections to be dropped by rabbitmq and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now I feel we are in a stable situation regarding this, that's why I am surprised about deprecating heartbeat_in_pthread now.
Can we, as least, make sure we keep all of this until we switch off eventlet? In other words, can we get rid of eventlet, then remove this params? and not the opposite?
Regards,
Arnaud
On 01.10.24 - 11:38, smooney@redhat.com wrote:
im glad you managed to make it work but form a nova perspective we do not recommend using heartbeat_in_pthread=true with nova-compute to the point that i woudl cosndier that config unsupported.
we also dont recommend using it with nova-api even when running via a wsgi server such as mod_wsgi or uwsgi.
the only thing this has ever done is remove a cosmetic waring in the rabbit/nova logs due to the heartbeat timing out. This has never fix any functional bug that we were aware of but has resulted in several real bugs.
the most recent we hit was https://launchpad.net/bugs/1983863 which was mitigated by https://review.opendev.org/c/openstack/oslo.log/+/852443 however that uses a unsafe debug option in eventlet eventlet.debug.hub_prevent_multiple_readers(False)
while you may be able to make heartbeat_in_pthread work with a lot of work as Takashi noted this will eventually go away when we remove evently and to enable that removal we need to replace the PipeMutex that currently fixes logging in a native thread so heartbeat_in_pthread is part of the technial debt we need to remvoe to evenrally allow us to move away form eventlet entirly.
On Tue, 2024-10-01 at 09:13 +0000, Arnaud Morin wrote:
Yes, I agree that it used to be broken, but since the bug was reported, we merged the following fixes:
https://review.opendev.org/c/openstack/oslo.messaging/+/894731 https://review.opendev.org/c/openstack/oslo.messaging/+/875615 https://review.opendev.org/c/openstack/oslo.messaging/+/876318
That's why I believe everything should be fine now :)
On 01.10.24 - 17:20, Takashi Kajinami wrote:
I was too fast to push Send button.
It's still interesting to see that you enabled the feature for eventlet services, such as nova-compute. In the past we got a few bugs caused by that feature, which made us eventually revert the default value to False. https://bugs.launchpad.net/oslo.messaging/+bug/1934937 https://bugs.launchpad.net/oslo.messaging/+bug/1949964 https://bugs.launchpad.net/oslo.messaging/+bug/1949964
You might need to check if the reported problem is reproduced in your env.
On 10/1/24 17:15, Takashi Kajinami wrote:
Setting heartbeat_in_pthread is known to break services using eventlet so it SHOULD NOT be enabled by default. We tried to enable it by default in the past but eventually reverted it after seeing multiple problems.
You can selectively disable it for services not using eventlet (api services run by http + mod_wsgi or uwsgi) but should keep it False for the other services.
Once we get rid of eventlet then we no longer use eventlet thread for heartbeat so we no longer need that option (because the behavior would be equivalent to one with heartbeat_in_pthread=True). But until that point we can't change the default, unless someone is willing to dig into the past problems to make the feature completely work with eventlet (which I don't think worth paying effort for at this stage).
On 10/1/24 16:34, Arnaud Morin wrote:
Hello,
I completely miss the deprecation of heartbeat_in_pthread in oslo.messaging [1].
We heavily rely on this parameter downstream and our opinion is that it should be set to True by default. We use it for both wsgi services and agents (nova-compute, neutron agents, etc.).
I understand that eventlet will be dropped in the future, but should we set heartbeat_in_pthread to True by default until then?
Regards,
Arnaud.
[1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778
On Tue, 2024-10-01 at 22:32 +0200, Michel Jouvin wrote:
Hi,
I am not an expert in these matters but we recently suffered the problem of client deconnection in RabbitMQ due to the heartbeat timeout and I confirm it was a disaster for the cloud usage with many things not working properly (we are running Antelope, except Barbican/Heat/Magnum where we run Caracal). The reason is still not clear for me, it was fixed by increasing the heartbeat timeout but at the same time, my colleague who implemented the change also defined heartbeat_in_pthread=true for all the services, something normally unnecessary as we configure uwsgi or Apache to use only one thread (and several processes). Initially we didn't see any bad impact of this setting but a few days ago users started to report that Magnum cluster creation was failing due a "response timeout" in Heat during the master software deployment.
Reading this thread this morning I had the idea if could be the source of the problem (as the service was running properly a couple of weeks ago, before the change). We reverted the change and defined heartbeat_in_pthread=false and it restored the normal behaviour of Heat. We have not seen a negative impact on other services so far. So I confirm that setting this parameter to false by default seems a good idea and that setting it to true can break some services like Heat.
thank you for the data point, im sure you will monitor the situration in your clodu but please let use know in a week or two if the heat/magnum issues you obsevered retrun or if the could continue to fucntion normally, i expect it to but again it would be a good data point.
Cheers,
Michel
Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
Hey,
I totally agree about the fact that heartbeat_in_pthread and the oslo.log PipeMutex are technical debt that we need to get rid of, as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side, we believe it's not. I can't prove / reproduce the issue on a small infra, but definetely, at large scale, having those tcp connections to be dropped by rabbitmq and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now I feel we are in a stable situation regarding this, that's why I am surprised about deprecating heartbeat_in_pthread now.
deprecateign a config option requires the deprecation to be advertised in a slrup before it can then be removed in a follwoing release. Given the deprecation was done in dalmaition 2024.2 which is not a slurp release the removal cannot take effect in 2025.1, 2025.2 is the earliest release we could remove this option. as a result i think maintaining the deprecation is correct here. we may decied not to remove this until 2026.1 or later but i think its correct to send the message that people should avoid cahnging this option form tis default fo False. we coudl even tag this option as advanced to make that more clear https://docs.openstack.org/oslo.config/latest/reference/defining.html#advanc...
Can we, as least, make sure we keep all of this until we switch off eventlet? In other words, can we get rid of eventlet, then remove this params? and not the opposite?
Regards,
Arnaud
Hi Sean, As for the situation in our cloud after reverting to heartbeat_in_pthread=false, it was a "black and white situation": creating a cluster was impossible (because of the Heat issue mentioned) since we changed to heartbeat_in_pthread=true (but we didn't realize immediately as we don't create clusters everyday) and restarted to work properly immediately after reverting to heartbeat_in_pthread=false. There is a clear link between this parameter and Heat behaviour (Caracal version in our case, so Oslo client 24.0). As for your last sentence "people should avoid cahnging this option form tis default fo False.", I think/hope you wanted to say the opposite : "people should avoid changing this option form its default to True."... Michel Le 02/10/2024 à 11:44, smooney@redhat.com a écrit :
Hi,
I am not an expert in these matters but we recently suffered the problem of client deconnection in RabbitMQ due to the heartbeat timeout and I confirm it was a disaster for the cloud usage with many things not working properly (we are running Antelope, except Barbican/Heat/Magnum where we run Caracal). The reason is still not clear for me, it was fixed by increasing the heartbeat timeout but at the same time, my colleague who implemented the change also defined heartbeat_in_pthread=true for all the services, something normally unnecessary as we configure uwsgi or Apache to use only one thread (and several processes). Initially we didn't see any bad impact of this setting but a few days ago users started to report that Magnum cluster creation was failing due a "response timeout" in Heat during the master software deployment.
Reading this thread this morning I had the idea if could be the source of the problem (as the service was running properly a couple of weeks ago, before the change). We reverted the change and defined heartbeat_in_pthread=false and it restored the normal behaviour of Heat. We have not seen a negative impact on other services so far. So I confirm that setting this parameter to false by default seems a good idea and that setting it to true can break some services like Heat.
On Tue, 2024-10-01 at 22:32 +0200, Michel Jouvin wrote: thank you for the data point, im sure you will monitor the situration in your clodu but please let use know in a week or two if the heat/magnum issues you obsevered retrun or if the could continue to fucntion normally, i expect it to but again it would be a good data point.
Cheers,
Michel
Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
Hey,
I totally agree about the fact that heartbeat_in_pthread and the oslo.log PipeMutex are technical debt that we need to get rid of, as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side, we believe it's not. I can't prove / reproduce the issue on a small infra, but definetely, at large scale, having those tcp connections to be dropped by rabbitmq and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now I feel we are in a stable situation regarding this, that's why I am surprised about deprecating heartbeat_in_pthread now. deprecateign a config option requires the deprecation to be advertised in a slrup before it can then be removed in a follwoing release. Given the deprecation was done in dalmaition 2024.2 which is not a slurp release the removal cannot take effect in 2025.1, 2025.2 is the earliest release we could remove this option.
as a result i think maintaining the deprecation is correct here. we may decied not to remove this until 2026.1 or later but i think its correct to send the message that people should avoid cahnging this option form tis default fo False. we coudl even tag this option as advanced to make that more clear https://docs.openstack.org/oslo.config/latest/reference/defining.html#advanc...
Can we, as least, make sure we keep all of this until we switch off eventlet? In other words, can we get rid of eventlet, then remove this params? and not the opposite?
Regards,
Arnaud
On Wed, 2024-10-02 at 13:06 +0200, Michel Jouvin wrote:
Hi Sean,
As for the situation in our cloud after reverting to heartbeat_in_pthread=false, it was a "black and white situation": creating a cluster was impossible (because of the Heat issue mentioned) since we changed to heartbeat_in_pthread=true (but we didn't realize immediately as we don't create clusters everyday) and restarted to work properly immediately after reverting to heartbeat_in_pthread=false. There is a clear link between this parameter and Heat behaviour (Caracal version in our case, so Oslo client 24.0).
As for your last sentence "people should avoid cahnging this option form tis default fo False.", I think/hope you wanted to say the opposite : "people should avoid changing this option form its default to True."...
no at least for nova we default to false in our downstream product https://github.com/openstack-k8s-operators/nova-operator/blob/main/templates... we had signifcant ci usee when we had it set to true orgianly because of the oslo.log issue but we didn not revert to enabling this for nova-api after that was backported because we have not seen any sideefct from seting it to false. we have never set this to true in OSP to my knolage for nova-comptue puppet-nova considered it experimental https://opendev.org/openstack/puppet-nova/src/commit/17bd61e042591305e461e5c... in tripleo we disabled it in many services including heat and nova https://github.com/openstack-archive/tripleo-heat-templates/commit/cf4d4f881... in kolla it also default to false for nova-compute and other eventlet services https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f71... although it is enabel for nova-api and some heat compoents https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f71... https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f71... at least for nova i would not recommend using heartbeat_in_pthread = True for any serrvice nova-api runnign under uwsgi or mod_wsgi is the only possibel excpetion and even then i woudl discurage it. i cant really speak to other services but i think `heartbeat_in_pthread = false` is generally the correct default.
Michel
Le 02/10/2024 à 11:44, smooney@redhat.com a écrit :
Hi,
I am not an expert in these matters but we recently suffered the problem of client deconnection in RabbitMQ due to the heartbeat timeout and I confirm it was a disaster for the cloud usage with many things not working properly (we are running Antelope, except Barbican/Heat/Magnum where we run Caracal). The reason is still not clear for me, it was fixed by increasing the heartbeat timeout but at the same time, my colleague who implemented the change also defined heartbeat_in_pthread=true for all the services, something normally unnecessary as we configure uwsgi or Apache to use only one thread (and several processes). Initially we didn't see any bad impact of this setting but a few days ago users started to report that Magnum cluster creation was failing due a "response timeout" in Heat during the master software deployment.
Reading this thread this morning I had the idea if could be the source of the problem (as the service was running properly a couple of weeks ago, before the change). We reverted the change and defined heartbeat_in_pthread=false and it restored the normal behaviour of Heat. We have not seen a negative impact on other services so far. So I confirm that setting this parameter to false by default seems a good idea and that setting it to true can break some services like Heat.
On Tue, 2024-10-01 at 22:32 +0200, Michel Jouvin wrote: thank you for the data point, im sure you will monitor the situration in your clodu but please let use know in a week or two if the heat/magnum issues you obsevered retrun or if the could continue to fucntion normally, i expect it to but again it would be a good data point.
Cheers,
Michel
Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
Hey,
I totally agree about the fact that heartbeat_in_pthread and the oslo.log PipeMutex are technical debt that we need to get rid of, as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side, we believe it's not. I can't prove / reproduce the issue on a small infra, but definetely, at large scale, having those tcp connections to be dropped by rabbitmq and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now I feel we are in a stable situation regarding this, that's why I am surprised about deprecating heartbeat_in_pthread now. deprecateign a config option requires the deprecation to be advertised in a slrup before it can then be removed in a follwoing release. Given the deprecation was done in dalmaition 2024.2 which is not a slurp release the removal cannot take effect in 2025.1, 2025.2 is the earliest release we could remove this option.
as a result i think maintaining the deprecation is correct here. we may decied not to remove this until 2026.1 or later but i think its correct to send the message that people should avoid cahnging this option form tis default fo False. we coudl even tag this option as advanced to make that more clear https://docs.openstack.org/oslo.config/latest/reference/defining.html#advanc...
Can we, as least, make sure we keep all of this until we switch off eventlet? In other words, can we get rid of eventlet, then remove this params? and not the opposite?
Regards,
Arnaud
Hi Sean, Not sure why we misunderstood eachother but we agree! I understood your sentence as "people should avoid changing this option from its default (False) to True." but I understand now you mean the opposite and I totally agree based on our experience. Heat seems to be another service that will be in trouble if it is changed. Michel Le 02/10/2024 à 14:17, smooney@redhat.com a écrit :
On Wed, 2024-10-02 at 13:06 +0200, Michel Jouvin wrote:
Hi Sean,
As for the situation in our cloud after reverting to heartbeat_in_pthread=false, it was a "black and white situation": creating a cluster was impossible (because of the Heat issue mentioned) since we changed to heartbeat_in_pthread=true (but we didn't realize immediately as we don't create clusters everyday) and restarted to work properly immediately after reverting to heartbeat_in_pthread=false. There is a clear link between this parameter and Heat behaviour (Caracal version in our case, so Oslo client 24.0).
As for your last sentence "people should avoid cahnging this option form tis default fo False.", I think/hope you wanted to say the opposite : "people should avoid changing this option form its default to True."... no at least for nova we default to false in our downstream product
https://github.com/openstack-k8s-operators/nova-operator/blob/main/templates...
we had signifcant ci usee when we had it set to true orgianly because of the oslo.log issue but we didn not revert to enabling this for nova-api after that was backported because we have not seen any sideefct from seting it to false.
we have never set this to true in OSP to my knolage for nova-comptue
puppet-nova considered it experimental https://opendev.org/openstack/puppet-nova/src/commit/17bd61e042591305e461e5c...
in tripleo we disabled it in many services including heat and nova https://github.com/openstack-archive/tripleo-heat-templates/commit/cf4d4f881...
in kolla it also default to false for nova-compute and other eventlet services https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f71... although it is enabel for nova-api and some heat compoents https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f71... https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f71...
at least for nova i would not recommend using
heartbeat_in_pthread = True for any serrvice nova-api runnign under uwsgi or mod_wsgi is the only possibel excpetion and even then i woudl discurage it.
i cant really speak to other services but i think `heartbeat_in_pthread = false` is generally the correct default.
Michel
Le 02/10/2024 à 11:44, smooney@redhat.com a écrit :
Hi,
I am not an expert in these matters but we recently suffered the problem of client deconnection in RabbitMQ due to the heartbeat timeout and I confirm it was a disaster for the cloud usage with many things not working properly (we are running Antelope, except Barbican/Heat/Magnum where we run Caracal). The reason is still not clear for me, it was fixed by increasing the heartbeat timeout but at the same time, my colleague who implemented the change also defined heartbeat_in_pthread=true for all the services, something normally unnecessary as we configure uwsgi or Apache to use only one thread (and several processes). Initially we didn't see any bad impact of this setting but a few days ago users started to report that Magnum cluster creation was failing due a "response timeout" in Heat during the master software deployment.
Reading this thread this morning I had the idea if could be the source of the problem (as the service was running properly a couple of weeks ago, before the change). We reverted the change and defined heartbeat_in_pthread=false and it restored the normal behaviour of Heat. We have not seen a negative impact on other services so far. So I confirm that setting this parameter to false by default seems a good idea and that setting it to true can break some services like Heat.
On Tue, 2024-10-01 at 22:32 +0200, Michel Jouvin wrote: thank you for the data point, im sure you will monitor the situration in your clodu but please let use know in a week or two if the heat/magnum issues you obsevered retrun or if the could continue to fucntion normally, i expect it to but again it would be a good data point.
Cheers,
Michel
Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
Hey,
I totally agree about the fact that heartbeat_in_pthread and the oslo.log PipeMutex are technical debt that we need to get rid of, as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side, we believe it's not. I can't prove / reproduce the issue on a small infra, but definetely, at large scale, having those tcp connections to be dropped by rabbitmq and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now I feel we are in a stable situation regarding this, that's why I am surprised about deprecating heartbeat_in_pthread now. deprecateign a config option requires the deprecation to be advertised in a slrup before it can then be removed in a follwoing release. Given the deprecation was done in dalmaition 2024.2 which is not a slurp release the removal cannot take effect in 2025.1, 2025.2 is the earliest release we could remove this option.
as a result i think maintaining the deprecation is correct here. we may decied not to remove this until 2026.1 or later but i think its correct to send the message that people should avoid cahnging this option form tis default fo False. we coudl even tag this option as advanced to make that more clear https://docs.openstack.org/oslo.config/latest/reference/defining.html#advanc...
Can we, as least, make sure we keep all of this until we switch off eventlet? In other words, can we get rid of eventlet, then remove this params? and not the opposite?
Regards,
Arnaud
Voici la version mise à jour avec la solution de **retry backoff** et les liens supplémentaires : Hey folks, The major concern behind this thread is that RabbitMQ connection drops due to the absence of a reliable heartbeat. While `heartbeat_in_pthread=True` aimed to fix this, it introduced other bugs. Indeed, the Greenlet documentation is pretty clear, the limitations between Python threads and greenlets lead to issues. As eventlet is itself based on greenlet it leads to recurring issues in our stacks. The heartbeat_in_pthread bugs are living examples of this kind of issue. For this reason, we support keeping `heartbeat_in_pthread` disabled by default. As a workaround, adjusting the RabbitMQ `heartbeat_timeout` and `rabbit_heartbeat_timeout_threshold` can mitigate connection drops. Additionally, oslo.messaging offers the `connection_retry_interval` and `connection_retry_backoff` parameters, which implement retry backoff strategies to better handle connection drops. This ensures that the system can manage reconnections more efficiently. We encourage investigating these paths to mitigate the connection problems. For more details please read: - https://greenlet.readthedocs.io/en/latest/python_threads.html - https://docs.openstack.org/oslo.messaging/xena/configuration/opts.html#oslo_... - https://www.rabbitmq.com/docs/heartbeats - https://docs.openstack.org/oslo.messaging/xena/configuration/opts.html#oslo_... - https://docs.openstack.org/oslo.messaging/xena/configuration/opts.html#oslo_... Le jeu. 3 oct. 2024 à 22:46, Michel Jouvin <michel.jouvin@ijclab.in2p3.fr> a écrit :
Hi Sean,
Not sure why we misunderstood eachother but we agree! I understood your sentence as "people should avoid changing this option from its default (False) to True." but I understand now you mean the opposite and I totally agree based on our experience. Heat seems to be another service that will be in trouble if it is changed.
Michel
Le 02/10/2024 à 14:17, smooney@redhat.com a écrit :
On Wed, 2024-10-02 at 13:06 +0200, Michel Jouvin wrote:
Hi Sean,
As for the situation in our cloud after reverting to heartbeat_in_pthread=false, it was a "black and white situation": creating a cluster was impossible (because of the Heat issue mentioned) since we changed to heartbeat_in_pthread=true (but we didn't realize immediately as we don't create clusters everyday) and restarted to work properly immediately after reverting to heartbeat_in_pthread=false. There is a clear link between this parameter and Heat behaviour (Caracal version in our case, so Oslo client 24.0).
As for your last sentence "people should avoid cahnging this option form tis default fo False.", I think/hope you wanted to say the opposite : "people should avoid changing this option form its default to True."... no at least for nova we default to false in our downstream product
https://github.com/openstack-k8s-operators/nova-operator/blob/main/templates...
we had signifcant ci usee when we had it set to true orgianly because of
the oslo.log issue
but we didn not revert to enabling this for nova-api after that was backported because we have not seen any sideefct from seting it to false.
we have never set this to true in OSP to my knolage for nova-comptue
puppet-nova considered it experimental
https://opendev.org/openstack/puppet-nova/src/commit/17bd61e042591305e461e5c...
in tripleo we disabled it in many services including heat and nova
https://github.com/openstack-archive/tripleo-heat-templates/commit/cf4d4f881...
in kolla it also default to false for nova-compute and other eventlet
services
https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f71...
although it is enabel for nova-api and some heat compoents
https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f71...
https://github.com/openstack/kolla-ansible/blob/2218b7852fda94d0f498d5140f71...
at least for nova i would not recommend using
heartbeat_in_pthread = True for any serrvice nova-api runnign under uwsgi or mod_wsgi is the only possibel excpetion
and even then i woudl discurage it.
i cant really speak to other services but i think `heartbeat_in_pthread
= false` is generally the correct default.
Michel
Le 02/10/2024 à 11:44, smooney@redhat.com a écrit :
On Tue, 2024-10-01 at 22:32 +0200, Michel Jouvin wrote:
Hi,
I am not an expert in these matters but we recently suffered the
problem
of client deconnection in RabbitMQ due to the heartbeat timeout and I confirm it was a disaster for the cloud usage with many things not working properly (we are running Antelope, except Barbican/Heat/Magnum where we run Caracal). The reason is still not clear for me, it was fixed by increasing the heartbeat timeout but at the same time, my colleague who implemented the change also defined heartbeat_in_pthread=true for all the services, something normally unnecessary as we configure uwsgi or Apache to use only one thread (and several processes). Initially we didn't see any bad impact of this setting but a few days ago users started to report that Magnum cluster creation was failing due a "response timeout" in Heat during the master software deployment.
Reading this thread this morning I had the idea if could be the source of the problem (as the service was running properly a couple of weeks ago, before the change). We reverted the change and defined heartbeat_in_pthread=false and it restored the normal behaviour of Heat. We have not seen a negative impact on other services so far. So I confirm that setting this parameter to false by default seems a good idea and that setting it to true can break some services like Heat. thank you for the data point, im sure you will monitor the situration in your clodu but please let use know in a week or two if the heat/magnum issues you obsevered retrun or if the could continue to fucntion normally, i expect it to but again it would be a good data point. Cheers,
Michel
Le 01/10/2024 à 16:31, Arnaud Morin a écrit :
Hey,
I totally agree about the fact that heartbeat_in_pthread and the oslo.log PipeMutex are technical debt that we need to get rid of, as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side, we believe it's not. I can't prove / reproduce the issue on a small infra, but definetely, at large scale, having those tcp connections to be dropped by rabbitmq and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now I feel we are in a stable situation regarding this, that's why I am surprised about deprecating heartbeat_in_pthread now. deprecateign a config option requires the deprecation to be advertised in a slrup before it can then be removed in a follwoing release. Given the deprecation was done in dalmaition 2024.2 which is not a slurp release the removal cannot take effect in 2025.1, 2025.2 is the earliest release we could remove this option.
as a result i think maintaining the deprecation is correct here. we may decied not to remove this until 2026.1 or later but i think its correct to send the message that people should avoid cahnging this option form tis default fo False. we coudl even tag this option as advanced to make that more clear
https://docs.openstack.org/oslo.config/latest/reference/defining.html#advanc...
Can we, as least, make sure we keep all of this until we switch off eventlet? In other words, can we get rid of eventlet, then remove this params? and not the opposite?
Regards,
Arnaud
-- Hervé Beraud Senior Software Engineer at Red Hat irc: hberaud https://github.com/4383/
On 10/1/24 23:31, Arnaud Morin wrote:
Hey,
I totally agree about the fact that heartbeat_in_pthread and the oslo.log PipeMutex are technical debt that we need to get rid of, as well as eventlet.
However, despite the fact that it seems purely cosmetic on your side, we believe it's not. I can't prove / reproduce the issue on a small infra, but definetely, at large scale, having those tcp connections to be dropped by rabbitmq and recreated in a loop by agents is affecting the cluster.
I know all the pain that these settings introduced in the past, but now I feel we are in a stable situation regarding this, that's why I am surprised about deprecating heartbeat_in_pthread now.
Can we, as least, make sure we keep all of this until we switch off eventlet? In other words, can we get rid of eventlet, then remove this params? and not the opposite?
That's the plan. We deprecated the parameter because it is no longer useful *ONCE* we get rid of eventlet completely. The parameter will be removed ONLY AFTER the eventlet removal is down.
Regards,
Arnaud
On 01.10.24 - 11:38, smooney@redhat.com wrote:
im glad you managed to make it work but form a nova perspective we do not recommend using heartbeat_in_pthread=true with nova-compute to the point that i woudl cosndier that config unsupported.
we also dont recommend using it with nova-api even when running via a wsgi server such as mod_wsgi or uwsgi.
the only thing this has ever done is remove a cosmetic waring in the rabbit/nova logs due to the heartbeat timing out. This has never fix any functional bug that we were aware of but has resulted in several real bugs.
the most recent we hit was https://launchpad.net/bugs/1983863 which was mitigated by https://review.opendev.org/c/openstack/oslo.log/+/852443 however that uses a unsafe debug option in eventlet eventlet.debug.hub_prevent_multiple_readers(False)
while you may be able to make heartbeat_in_pthread work with a lot of work as Takashi noted this will eventually go away when we remove evently and to enable that removal we need to replace the PipeMutex that currently fixes logging in a native thread so heartbeat_in_pthread is part of the technial debt we need to remvoe to evenrally allow us to move away form eventlet entirly.
On Tue, 2024-10-01 at 09:13 +0000, Arnaud Morin wrote:
Yes, I agree that it used to be broken, but since the bug was reported, we merged the following fixes:
https://review.opendev.org/c/openstack/oslo.messaging/+/894731 https://review.opendev.org/c/openstack/oslo.messaging/+/875615 https://review.opendev.org/c/openstack/oslo.messaging/+/876318
That's why I believe everything should be fine now :)
On 01.10.24 - 17:20, Takashi Kajinami wrote:
I was too fast to push Send button.
It's still interesting to see that you enabled the feature for eventlet services, such as nova-compute. In the past we got a few bugs caused by that feature, which made us eventually revert the default value to False. https://bugs.launchpad.net/oslo.messaging/+bug/1934937 https://bugs.launchpad.net/oslo.messaging/+bug/1949964 https://bugs.launchpad.net/oslo.messaging/+bug/1949964
You might need to check if the reported problem is reproduced in your env.
On 10/1/24 17:15, Takashi Kajinami wrote:
Setting heartbeat_in_pthread is known to break services using eventlet so it SHOULD NOT be enabled by default. We tried to enable it by default in the past but eventually reverted it after seeing multiple problems.
You can selectively disable it for services not using eventlet (api services run by http + mod_wsgi or uwsgi) but should keep it False for the other services.
Once we get rid of eventlet then we no longer use eventlet thread for heartbeat so we no longer need that option (because the behavior would be equivalent to one with heartbeat_in_pthread=True). But until that point we can't change the default, unless someone is willing to dig into the past problems to make the feature completely work with eventlet (which I don't think worth paying effort for at this stage).
On 10/1/24 16:34, Arnaud Morin wrote:
Hello,
I completely miss the deprecation of heartbeat_in_pthread in oslo.messaging [1].
We heavily rely on this parameter downstream and our opinion is that it should be set to True by default. We use it for both wsgi services and agents (nova-compute, neutron agents, etc.).
I understand that eventlet will be dropped in the future, but should we set heartbeat_in_pthread to True by default until then?
Regards,
Arnaud.
[1] https://review.opendev.org/c/openstack/oslo.messaging/+/925778
participants (5)
-
Arnaud Morin
-
Herve Beraud
-
Michel Jouvin
-
smooney@redhat.com
-
Takashi Kajinami