Re: [nova] migrated instance refusing to take traffic after live migration apparently complete

27 Nov 2025

      Have edited the subject for this subthread so as not to confuse with OP's
query - hope that's helpful...

On Thu, Nov 27, 2025 at 2:08 PM Sean Mooney <smooney@redhat.com> wrote:
...
On 27/11/2025 13:26, Nell Jerram wrote:
...
Could enable_qemu_monitor_announce_self blocking be responsible for 12
_minutes_ of delay?  That sounds huge!
i dont see any other way that that congi option coudl have an effect and
be responceible for the repored issue.
it does not make sense that changing that value woudl actuly afffect
this at all.
if that was a blockign call and it did nto return it may explain the
delay otherwise my actual opinion is htis is a coincidence
...
Also, can I ask if this is _only_ a problem with the OpenStack status
reporting (i.e. "openstack server migration list")?  Or does it also
affect the actual liveness of the migrated instance?
if its related to enable_qemu_monitor_announce_self it cant affect the
livelyness of the insace and it woudl obly be a reporting issue.
i think this si much more likely ot be related to this feature request
https://bugs.launchpad.net/nova/+bug/2128665
https://blueprints.launchpad.net/nova/+spec/refine-network-setup-procedure-i...
and hte comemnt thread we dicssed
https://review.opendev.org/c/openstack/nova/+/966106/1/nova/virt/libvirt/hos...
the tldr is there is a kernel bug
https://lore.kernel.org/all/20240626191830.3819324-1-yang@os.amperecomputing...
that is only fixed in 6.13 which can cause the souce vm to take minutes
to stop as it is waith for the kernel to
deaclocated the memory. we do not actully mark the live migration as
complete until after that is complete.
so i think that why its taking mintues for the status to go to complete.
...
(Coincidentally, I am also currently investigating live migration.
I'm seeing a problem where data transfer on an existing connection to
the instance is held up for about 12 seconds after the migration has
completed.)
im not sure but maybe that is related to the kernel bug? libvirt does
have to do more then just tasnfor the data before it can compelte the
migraton or unpause the vm on the dest
but i dont knwo the detail well enough to say what that entails in detail.
Thanks Sean.  To clarify/record a few details of my case:
- I'm using the Calico Neutron driver, so any OVN details won't be relevant
here.  Calico currently "handles" live migration by deleting the route for
the instance IP via the old node and creating a route to the instance IP
via the new node, at the point where Neutron changes the port's
"binding:host_id" to the new node.
- Empirically, there's a window of about 1.5s between the old route
disappearing and the new route appearing, on the relevant intermediate
routers.  During this window packets on the connection get retransmitted;
the window doesn't cause the connection to drop.
- Immediately after the window I see packets routed through to the instance
(now on the new node) - but it then takes another 12 seconds before the
instance starts responding to those.

I think my next step is to research what the Neutron binding:host_id
transition point corresponds to in Nova and libvirt terms, and then review
if the situation correlates with the bug that you mentioned.