Hello,

For those interested who haven't followed the discussion on Gerrit and Launchpad, another patch addressing this issue was merged to master and has now been backported to all stable branches: https://review.opendev.org/q/I60d6f04d374e9ede5895a43b7a75e955b0fea3c5

Best regards,
Pierre Riteau (priteau)

On Tue, 24 Dec 2024 at 10:07, Masahito Muroi <masahito.muroi@lycorp.co.jp> wrote:
Hi,

We hit the same issue in our deployment, Dalmatian release.  In our case, the nova-compute's libvirt access doesn't trigger thread context switch over 120 seconds and the long-running task triggers some heartbeat task failures.

I pushed one fix[1] to the gerrit.


best regards,
Masahito

-----Original Message-----
From: "Jakub Darmach"<jakub.darmach@gmail.com>
To: <openstack-discuss@lists.openstack.org>;
Cc:
Sent: 2024/12/23(月) 22:34 (GMT+09:00)
Subject: [Nova] Nova-compute service flapping on Antelope

Hello,

Recently I encountered an interesting issue - nova-compute service started to temporarily lose Rabbit connectivity to regain it after a few seconds on Antelope. The issue was replicated on the test environment, without specific steps to replicate though - it just starts to happen after some time.

I pasted a log example in the bug I opened [1].

First we can see rabbit logging closing AMQP connection, followed by nova-compute reporting rabbit server being unreachable, with next message being successful reconnection. Restarting nova-compute services helps temporarily, the issue starts to manifest after a few days.
Initially disconnects once every few hours, each time in shorter intervals, to the point it happens every minute or so.

Did anyone encounter something similar?


Pozdrawiam / Best regards,
Jakub Darmach