On 27/11/2025 15:06, Nell Jerram wrote:
> Have edited the subject for this subthread so as not to confuse with
> OP's query - hope that's helpful...
>
> On Thu, Nov 27, 2025 at 2:08 PM Sean Mooney <smooney@redhat.com> wrote:
>
>
>
> On 27/11/2025 13:26, Nell Jerram wrote:
> > Could enable_qemu_monitor_announce_self blocking be responsible
> for 12
> > _minutes_ of delay? That sounds huge!
> i dont see any other way that that congi option coudl have an
> effect and
> be responceible for the repored issue.
>
> it does not make sense that changing that value woudl actuly afffect
> this at all.
>
> if that was a blockign call and it did nto return it may explain the
> delay otherwise my actual opinion is htis is a coincidence
>
> >
> > Also, can I ask if this is _only_ a problem with the OpenStack
> status
> > reporting (i.e. "openstack server migration list")? Or does it
> also
> > affect the actual liveness of the migrated instance?
> if its related to enable_qemu_monitor_announce_self it cant affect
> the
> livelyness of the insace and it woudl obly be a reporting issue.
>
> i think this si much more likely ot be related to this feature
> request
> https://bugs.launchpad.net/nova/+bug/2128665
> https://blueprints.launchpad.net/nova/+spec/refine-network-setup-procedure-in-live-migrations
> and hte comemnt thread we dicssed
> https://review.opendev.org/c/openstack/nova/+/966106/1/nova/virt/libvirt/host.py#297
>
> the tldr is there is a kernel bug
> https://lore.kernel.org/all/20240626191830.3819324-1-yang@os.amperecomputing.com/
> that is only fixed in 6.13 which can cause the souce vm to take
> minutes
> to stop as it is waith for the kernel to
> deaclocated the memory. we do not actully mark the live migration as
> complete until after that is complete.
>
> so i think that why its taking mintues for the status to go to
> complete.
>
> >
> > (Coincidentally, I am also currently investigating live migration.
> > I'm seeing a problem where data transfer on an existing
> connection to
> > the instance is held up for about 12 seconds after the migration
> has
> > completed.)
> im not sure but maybe that is related to the kernel bug? libvirt does
> have to do more then just tasnfor the data before it can compelte the
> migraton or unpause the vm on the dest
> but i dont knwo the detail well enough to say what that entails in
> detail.
>
>
> Thanks Sean. To clarify/record a few details of my case:
> - I'm using the Calico Neutron driver, so any OVN details won't be
> relevant here. Calico currently "handles" live migration by deleting
> the route for the instance IP via the old node and creating a route to
> the instance IP via the new node, at the point where Neutron changes
> the port's "binding:host_id" to the new node.
> - Empirically, there's a window of about 1.5s between the old route
> disappearing and the new route appearing, on the relevant intermediate
> routers. During this window packets on the connection get
> retransmitted; the window doesn't cause the connection to drop.
> - Immediately after the window I see packets routed through to the
> instance (now on the new node) - but it then takes another 12 seconds
> before the instance starts responding to those.
>
> I think my next step is to research what the Neutron binding:host_id
> transition point corresponds to in Nova and libvirt terms, and then
> review if the situation correlates with the bug that you mentioned.
|So I think we discussed this a bit when we were fixing the Calico
integration fixes for Nova. The expected behavior is in pre-live migrate
(while the VM is running on the source) Nova will create a second port
binding for the destination host. For most backends like OVS, this is
when the OVS port would be created on the destination host. For Calico,
this is when we should be creating the tap device on the dest. We
generally refer to the creation of the L1 port on the network backend
(logically or actually creating it in the case of a tap) as port
plugging. Port plugging happens after the inactive port binding is
created for the destination host and is bound by the Neutron ML2 driver.
The expected behavior of the Neutron backend will wire up the logical
port on the destination such that the VM can send packets when it is
created with that logical port by QEMU. When Libvirt creates the VM on
the dest for live migration, it does so in the paused state so that it
can do the memory/disk copy. Then the migration is complete, right
before the VM is unpaused on the dest, QEMU sends 3 RARP packets to
update the network with MAC learning frames. Now for Calico, these
broadcast frames are not required to have packet flow work, but they are
still sent. The port binding on the destination host is only activated
in post-live-migration. https://bugs.launchpad.net/nova/+bug/2128665
describes how
https://lore.kernel.org/all/20240626191830.3819324-1-yang@os.amperecomputing.com/
can result in post live migration being delayed by 10s of seconds while
Nova blocks on getting the result of the migration complete job from
Libvirt due to Libvirt waiting for the QEMU process to be terminated by
the kernel. If you are using post-copy live migration, you can sidestep
the kernel issues as we will trigger post-live-migration earlier and
activate the port binding for the destination sooner. The 1.5 second
route propagation is likely a combination of the time it takes it takes
from the time we activate the Neutron to be seen by Calico, calico to
recompute the routes and the routes to propagate. I'm not sure how much
we can do to optimize that as only a small subset of that time is
actually in Nova/Neutron. The most important optimization to this
workflow that we have done in the last few years that would apply to
Calico is
https://opendev.org/openstack/nova/commit/26fbc9e8e7d353e66739f910865d0b6498811bb0?style=split&whitespace=show-all&show-outdated=.
Prior to that, we would not activate the destination port binding until
after we had cleaned up the Cinder block devices on the source host.
that could cuase the vm to be runing on the guest for a number of second
before nova would activate the destiation prot bidning. this is espiclly
ture if there is a bug in your stogage vendors san that cause that to
take 10 of second for it to respond... Regarding the 12 second to have
the VM respond, that sounds like an issue in the guest, not at the infra
level. if you counted it form the tiem the guest was unpased part of
that could be the time it took to activate the port bining but if this
is just form the point the route was updated that point to an issue in
the guest. I would recommend creating 2 VMs and have them both ping each
other, then live migrate one of them and see if that changes the
behavior or not. If you see a reduction in the downtime when you have
pings flowing in both directions, that generally implies there is some
cached state in the guest routing table or ARP table that is a factor or
in your core network. The fact you're seeing the packet reach the tap on
the destination however, implies it's guest side. regards sean|
Many thanks Sean. Right now I'm most worried about the 12s gap _after_ activation of port bindings on the destination. As you say, this may be a guest problem; I'm continuing to investigate. But it's useful to be aware also of possible concerns before that activation happens, and of work to move that earlier. We are also still planning to leverage the pre-live-migration point in Neutron in order to start setting up networking for the destination VM earlier in the process - does Nova create the TAP interface on the destination at that point? - but currently the gap _after_ activation is a bigger concern for us.