Have edited the subject for this subthread so as not to confuse with OP's query - hope that's helpful...
On 27/11/2025 13:26, Nell Jerram wrote:
> Could enable_qemu_monitor_announce_self blocking be responsible for 12
> _minutes_ of delay? That sounds huge!
i dont see any other way that that congi option coudl have an effect and
be responceible for the repored issue.
it does not make sense that changing that value woudl actuly afffect
this at all.
if that was a blockign call and it did nto return it may explain the
delay otherwise my actual opinion is htis is a coincidence
>
> Also, can I ask if this is _only_ a problem with the OpenStack status
> reporting (i.e. "openstack server migration list")? Or does it also
> affect the actual liveness of the migrated instance?
if its related to enable_qemu_monitor_announce_self it cant affect the
livelyness of the insace and it woudl obly be a reporting issue.
i think this si much more likely ot be related to this feature request
https://bugs.launchpad.net/nova/+bug/2128665
https://blueprints.launchpad.net/nova/+spec/refine-network-setup-procedure-in-live-migrations
and hte comemnt thread we dicssed
https://review.opendev.org/c/openstack/nova/+/966106/1/nova/virt/libvirt/host.py#297
the tldr is there is a kernel bug
https://lore.kernel.org/all/20240626191830.3819324-1-yang@os.amperecomputing.com/
that is only fixed in 6.13 which can cause the souce vm to take minutes
to stop as it is waith for the kernel to
deaclocated the memory. we do not actully mark the live migration as
complete until after that is complete.
so i think that why its taking mintues for the status to go to complete.
>
> (Coincidentally, I am also currently investigating live migration.
> I'm seeing a problem where data transfer on an existing connection to
> the instance is held up for about 12 seconds after the migration has
> completed.)
im not sure but maybe that is related to the kernel bug? libvirt does
have to do more then just tasnfor the data before it can compelte the
migraton or unpause the vm on the dest
but i dont knwo the detail well enough to say what that entails in detail.
Thanks Sean. To clarify/record a few details of my case:
- I'm using the Calico Neutron driver, so any OVN details won't be relevant here. Calico currently "handles" live migration by deleting the route for the instance IP via the old node and creating a route to the instance IP via the new node, at the point where Neutron changes the port's "binding:host_id" to the new node.
- Empirically, there's a window of about 1.5s between the old route disappearing and the new route appearing, on the relevant intermediate routers. During this window packets on the connection get retransmitted; the window doesn't cause the connection to drop.
- Immediately after the window I see packets routed through to the instance (now on the new node) - but it then takes another 12 seconds before the instance starts responding to those.
I think my next step is to research what the Neutron binding:host_id transition point corresponds to in Nova and libvirt terms, and then review if the situation correlates with the bug that you mentioned.