[Nova] Nova-compute service flapping on Antelope
Hello, Recently I encountered an interesting issue - nova-compute service started to temporarily lose Rabbit connectivity to regain it after a few seconds on Antelope. The issue was replicated on the test environment, without specific steps to replicate though - it just starts to happen after some time. I pasted a log example in the bug I opened [1]. First we can see rabbit logging closing AMQP connection, followed by nova-compute reporting rabbit server being unreachable, with next message being successful reconnection. Restarting nova-compute services helps temporarily, the issue starts to manifest after a few days. Initially disconnects once every few hours, each time in shorter intervals, to the point it happens every minute or so. Did anyone encounter something similar? [1] https://bugs.launchpad.net/nova/+bug/2092297 Pozdrawiam / Best regards, *Jakub Darmach*
Hi, We hit the same issue in our deployment, Dalmatian release. In our case, the nova-compute's libvirt access doesn't trigger thread context switch over 120 seconds and the long-running task triggers some heartbeat task failures. I pushed one fix[1] to the gerrit. 1. https://review.opendev.org/c/openstack/nova/+/938215 best regards, Masahito -----Original Message----- From: "Jakub Darmach"<jakub.darmach@gmail.com> To: <openstack-discuss@lists.openstack.org>; Cc: Sent: 2024/12/23(月) 22:34 (GMT+09:00) Subject: [Nova] Nova-compute service flapping on Antelope Hello, Recently I encountered an interesting issue - nova-compute service started to temporarily lose Rabbit connectivity to regain it after a few seconds on Antelope. The issue was replicated on the test environment, without specific steps to replicate though - it just starts to happen after some time. I pasted a log example in the bug I opened [1]. First we can see rabbit logging closing AMQP connection, followed by nova-compute reporting rabbit server being unreachable, with next message being successful reconnection. Restarting nova-compute services helps temporarily, the issue starts to manifest after a few days. Initially disconnects once every few hours, each time in shorter intervals, to the point it happens every minute or so. Did anyone encounter something similar? [1] https://bugs.launchpad.net/nova/+bug/2092297 Pozdrawiam / Best regards, Jakub Darmach
Hello, For those interested who haven't followed the discussion on Gerrit and Launchpad, another patch addressing this issue was merged to master and has now been backported to all stable branches: https://review.opendev.org/q/I60d6f04d374e9ede5895a43b7a75e955b0fea3c5 Best regards, Pierre Riteau (priteau) On Tue, 24 Dec 2024 at 10:07, Masahito Muroi <masahito.muroi@lycorp.co.jp> wrote:
Hi,
We hit the same issue in our deployment, Dalmatian release. In our case, the nova-compute's libvirt access doesn't trigger thread context switch over 120 seconds and the long-running task triggers some heartbeat task failures.
I pushed one fix[1] to the gerrit.
1. https://review.opendev.org/c/openstack/nova/+/938215
best regards, Masahito
-----Original Message----- *From:* "Jakub Darmach"<jakub.darmach@gmail.com> *To:* <openstack-discuss@lists.openstack.org>; *Cc:* *Sent:* 2024/12/23(月) 22:34 (GMT+09:00) *Subject:* [Nova] Nova-compute service flapping on Antelope
Hello,
Recently I encountered an interesting issue - nova-compute service started to temporarily lose Rabbit connectivity to regain it after a few seconds on Antelope. The issue was replicated on the test environment, without specific steps to replicate though - it just starts to happen after some time.
I pasted a log example in the bug I opened [1].
First we can see rabbit logging closing AMQP connection, followed by nova-compute reporting rabbit server being unreachable, with next message being successful reconnection. Restarting nova-compute services helps temporarily, the issue starts to manifest after a few days. Initially disconnects once every few hours, each time in shorter intervals, to the point it happens every minute or so.
Did anyone encounter something similar?
[1] https://bugs.launchpad.net/nova/+bug/2092297 <https://urldefense.com/v3/__https://bugs.launchpad.net/nova/*bug/2092297__;Kw!!AEH8rfA!wLWemH1yjCpgDr2uxDQrBS933a9jsJwfjsU4S6cVOAJHXvgTzZw2QlytREjB7AmtB1nE3A6mvCaj2SEKSyqqc-FK8-E7qTTT$>
Pozdrawiam / Best regards, *Jakub Darmach*
participants (3)
-
Jakub Darmach
-
Masahito Muroi
-
Pierre Riteau