nova-api missing heartbeats to rabbitmq
Folks, Recently i am seeing lots of error message in rabbitmq logs saying missing heartbeats from nova-api nodes, I am not seeing any issue at functionality level as everything working fine but just noticed those error and trying to find root cause of it. 172.28.15.125 nova-api server 172.28.15.192 rabbitmq server on rabbit.log 2020-03-24 12:21:41.389 [error] <0.29772.4418> closing AMQP connection <0.29772.4418> (172.28.15.125:42656 -> 172.28.15.192:5671 - uwsgi:32419:9b8a323b-653d-4585-9916-d52b3fd81d59): missed heartbeats from client, timeout: 60s on nova-api.log 2020-03-24 12:19:06.554 32435 ERROR oslo.messaging._drivers.impl_rabbit [-] [4b8adff0-ff9f-4863-a939-537d391e5d9e] AMQP server on 172.28.15.192:5671 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer
Folks,
Recently i am seeing lots of error message in rabbitmq logs saying missing heartbeats from nova-api nodes, I am not seeing any issue at functionality level as everything working fine but just noticed those error and trying to find root cause of it.
On Tue, 2020-03-24 at 15:04 -0400, Satish Patel wrote: this is expected and a know issue. a release or two ago we intoduced the use of eventlet monkey patching to the nova api to implemente multi cell scarter gater requests where by we concurrently dispatch request to all cells and then wait for the results instead of doing it serially. a side effect of that change is not that the nova api is monkey patched expcitly, if you execute it via uwsgi or mod_wsgi the heatbeat thread that was previously a full os thread is not jsut a green thread. the wsgi server manges the life time of the api process and can set that tread to sleep or kill it. at presnet there is nothing for the operator to do in regards to this message and you should just ignore bar one caviate. if you are configuring your api you should not scale it useing thread but instead shoudl scale the api using processes. deploying the api as a wsgi applciation with multiple threads per python process can cause issues so threads should always be set to 1 or unset. we have no real agreement on the long term fix. in some environments disableing the heartbeat and relying on the os tcp keepalive config is one option. you can also rever to running the api using the build in python wsgi server instead of uwsgi. if you do this there is a performacne pelenty so we dont really advise people to do that. there have been mail thread on this topic in the past but i do not have them to hand.
172.28.15.125 nova-api server 172.28.15.192 rabbitmq server
on rabbit.log
2020-03-24 12:21:41.389 [error] <0.29772.4418> closing AMQP connection <0.29772.4418> (172.28.15.125:42656 -> 172.28.15.192:5671 - uwsgi:32419:9b8a323b-653d-4585-9916-d52b3fd81d59): missed heartbeats from client, timeout: 60s
on nova-api.log
2020-03-24 12:19:06.554 32435 ERROR oslo.messaging._drivers.impl_rabbit [-] [4b8adff0-ff9f-4863-a939-537d391e5d9e] AMQP server on 172.28.15.192:5671 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer
On Tue, Mar 24, 2020 at 1:11 PM Satish Patel <satish.txt@gmail.com> wrote:
Folks,
Recently i am seeing lots of error message in rabbitmq logs saying missing heartbeats from nova-api nodes, I am not seeing any issue at functionality level as everything working fine but just noticed those error and trying to find root cause of it.
172.28.15.125 nova-api server 172.28.15.192 rabbitmq server
on rabbit.log
2020-03-24 12:21:41.389 [error] <0.29772.4418> closing AMQP connection <0.29772.4418> (172.28.15.125:42656 -> 172.28.15.192:5671 - uwsgi:32419:9b8a323b-653d-4585-9916-d52b3fd81d59): missed heartbeats from client, timeout: 60s
on nova-api.log
2020-03-24 12:19:06.554 32435 ERROR oslo.messaging._drivers.impl_rabbit [-] [4b8adff0-ff9f-4863-a939-537d391e5d9e] AMQP server on 172.28.15.192:5671 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer
ooo I know this one! See this archive thread: http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005305.htm... https://bugs.launchpad.net/nova/+bug/1825584 Thanks, -Alex
On 3/24/20 13:04, Alex Schultz wrote:
On Tue, Mar 24, 2020 at 1:11 PM Satish Patel <satish.txt@gmail.com> wrote:
Folks,
Recently i am seeing lots of error message in rabbitmq logs saying missing heartbeats from nova-api nodes, I am not seeing any issue at functionality level as everything working fine but just noticed those error and trying to find root cause of it.
172.28.15.125 nova-api server 172.28.15.192 rabbitmq server
on rabbit.log
2020-03-24 12:21:41.389 [error] <0.29772.4418> closing AMQP connection <0.29772.4418> (172.28.15.125:42656 -> 172.28.15.192:5671 - uwsgi:32419:9b8a323b-653d-4585-9916-d52b3fd81d59): missed heartbeats from client, timeout: 60s
on nova-api.log
2020-03-24 12:19:06.554 32435 ERROR oslo.messaging._drivers.impl_rabbit [-] [4b8adff0-ff9f-4863-a939-537d391e5d9e] AMQP server on 172.28.15.192:5671 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer
ooo I know this one! See this archive thread: http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005305.htm...
One more link for the pile: https://docs.openstack.org/releasenotes/nova/stein.html#known-issues This ^ is linked from the launchpad issue but I link it here directly to make it easier. Cheers, -melanie
participants (4)
-
Alex Schultz
-
melanie witt
-
Satish Patel
-
Sean Mooney