Re: [nova] migrated instance refusing to take traffic after live migration apparently complete

28 Nov 2025

      ...
On Thu, Nov 27, 2025 at 3:50 PM Sean Mooney <smooney@redhat.com> wrote:
On 27/11/2025 15:06, Nell Jerram wrote:
    > Have edited the subject for this subthread so as not to confuse
    with
    > OP's query - hope that's helpful...
    >
    > On Thu, Nov 27, 2025 at 2:08 PM Sean Mooney <smooney@redhat.com>
    wrote:
    >
    >
    >
    >     On 27/11/2025 13:26, Nell Jerram wrote:
    >     > Could enable_qemu_monitor_announce_self blocking be
    responsible
    >     for 12
    >     > _minutes_ of delay?  That sounds huge!
    >     i dont see any other way that that congi option coudl have an
    >     effect and
    >     be responceible for the repored issue.
    >
    >     it does not make sense that changing that value woudl actuly
    afffect
    >     this at all.
    >
    >     if that was a blockign call and it did nto return it may
    explain the
    >     delay otherwise my actual opinion is htis is a coincidence
    >
    >     >
    >     > Also, can I ask if this is _only_ a problem with the OpenStack
    >     status
    >     > reporting (i.e. "openstack server migration list")?  Or
    does it
    >     also
    >     > affect the actual liveness of the migrated instance?
    >     if its related to enable_qemu_monitor_announce_self it cant
    affect
    >     the
    >     livelyness of the insace and it woudl obly be a reporting issue.
    >
    >     i think this si much more likely ot be related to this feature
    >     request
    > https://bugs.launchpad.net/nova/+bug/2128665
    >
    https://blueprints.launchpad.net/nova/+spec/refine-network-setup-procedure-i...
    >     and hte comemnt thread we dicssed
    >
    https://review.opendev.org/c/openstack/nova/+/966106/1/nova/virt/libvirt/hos...
    >
    >     the tldr is there is a kernel bug
    >
    https://lore.kernel.org/all/20240626191830.3819324-1-yang@os.amperecomputing...
    >     that is only fixed in 6.13 which can cause the souce vm to take
    >     minutes
    >     to stop as it is waith for the kernel to
    >     deaclocated the memory. we do not actully mark the live
    migration as
    >     complete until after that is complete.
    >
    >     so i think that why its taking mintues for the status to go to
    >     complete.
    >
    >     >
    >     > (Coincidentally, I am also currently investigating live
    migration.
    >     > I'm seeing a problem where data transfer on an existing
    >     connection to
    >     > the instance is held up for about 12 seconds after the
    migration
    >     has
    >     > completed.)
    >     im not sure but maybe that is related to the kernel bug?
    libvirt does
    >     have to do more then just tasnfor the data before it can
    compelte the
    >     migraton or unpause the vm on the dest
    >     but i dont knwo the detail well enough to say what that
    entails in
    >     detail.
    >
    >
    > Thanks Sean.  To clarify/record a few details of my case:
    > - I'm using the Calico Neutron driver, so any OVN details won't be
    > relevant here.  Calico currently "handles" live migration by
    deleting
    > the route for the instance IP via the old node and creating a
    route to
    > the instance IP via the new node, at the point where Neutron
    changes
    > the port's "binding:host_id" to the new node.
    > - Empirically, there's a window of about 1.5s between the old route
    > disappearing and the new route appearing, on the relevant
    intermediate
    > routers.  During this window packets on the connection get
    > retransmitted; the window doesn't cause the connection to drop.
    > - Immediately after the window I see packets routed through to the
    > instance (now on the new node) - but it then takes another 12
    seconds
    > before the instance starts responding to those.
    >
    > I think my next step is to research what the Neutron
    binding:host_id
    > transition point corresponds to in Nova and libvirt terms, and then
    > review if the situation correlates with the bug that you mentioned.
|So I think we discussed this a bit when we were fixing the Calico
    integration fixes for Nova. The expected behavior is in pre-live
    migrate
    (while the VM is running on the source) Nova will create a second
    port
    binding for the destination host. For most backends like OVS, this is
    when the OVS port would be created on the destination host. For
    Calico,
    this is when we should be creating the tap device on the dest. We
    generally refer to the creation of the L1 port on the network backend
    (logically or actually creating it in the case of a tap) as port
    plugging. Port plugging happens after the inactive port binding is
    created for the destination host and is bound by the Neutron ML2
    driver.
    The expected behavior of the Neutron backend will wire up the logical
    port on the destination such that the VM can send packets when it is
    created with that logical port by QEMU. When Libvirt creates the
    VM on
    the dest for live migration, it does so in the paused state so
    that it
    can do the memory/disk copy. Then the migration is complete, right
    before the VM is unpaused on the dest, QEMU sends 3 RARP packets to
    update the network with MAC learning frames. Now for Calico, these
    broadcast frames are not required to have packet flow work, but
    they are
    still sent. The port binding on the destination host is only
    activated
    in post-live-migration. https://bugs.launchpad.net/nova/+bug/2128665
    describes how
    https://lore.kernel.org/all/20240626191830.3819324-1-yang@os.amperecomputing...
can result in post live migration being delayed by 10s of seconds
    while
    Nova blocks on getting the result of the migration complete job from
    Libvirt due to Libvirt waiting for the QEMU process to be
    terminated by
    the kernel. If you are using post-copy live migration, you can
    sidestep
    the kernel issues as we will trigger post-live-migration earlier and
    activate the port binding for the destination sooner. The 1.5 second
    route propagation is likely a combination of the time it takes it
    takes
    from the time we activate the Neutron to be seen by Calico, calico to
    recompute the routes and the routes to propagate. I'm not sure how
    much
    we can do to optimize that as only a small subset of that time is
    actually in Nova/Neutron. The most important optimization to this
    workflow that we have done in the last few years that would apply to
    Calico is
    https://opendev.org/openstack/nova/commit/26fbc9e8e7d353e66739f910865d0b6498811bb0?style=split&whitespace=show-all&show-outdated=
    <https://opendev.org/openstack/nova/commit/26fbc9e8e7d353e66739f910865d0b6498811bb0?style=split&whitespace=show-all&show-outdated=>.
Prior to that, we would not activate the destination port binding
    until
    after we had cleaned up the Cinder block devices on the source host.
    that could cuase the vm to be runing on the guest for a number of
    second
    before nova would activate the destiation prot bidning. this is
    espiclly
    ture if there is a bug in your stogage vendors san that cause that to
    take 10 of second for it to respond... Regarding the 12 second to
    have
    the VM respond, that sounds like an issue in the guest, not at the
    infra
    level. if you counted it form the tiem the guest was unpased part of
    that could be the time it took to activate the port bining but if
    this
    is just form the point the route was updated that point to an
    issue in
    the guest. I would recommend creating 2 VMs and have them both
    ping each
    other, then live migrate one of them and see if that changes the
    behavior or not. If you see a reduction in the downtime when you have
    pings flowing in both directions, that generally implies there is
    some
    cached state in the guest routing table or ARP table that is a
    factor or
    in your core network. The fact you're seeing the packet reach the
    tap on
    the destination however, implies it's guest side. regards sean|
Many thanks Sean.  Right now I'm most worried about the 12s gap 
_after_ activation of port bindings on the destination.  As you say, 
this may be a guest problem; I'm continuing to investigate.  But it's 
useful to be aware also of possible concerns before that activation 
happens, and of work to move that earlier.  We are also still planning 
to leverage the pre-live-migration point in Neutron in order to start 
setting up networking for the destination VM earlier in the process - 
does Nova create the TAP interface on the destination at that point?
we never got aroudn to moving the handelign for plug_tap to os-vif so 
its usign the legacy code path
https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1...
and its defiend here
https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1...
On 28/11/2025 17:58, Nell Jerram wrote:
that a low hangng fruit if folks ever want to contibute to os-vif.

so in pre live migration we do _pre_live_migration_plug_vifs

https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1...

which calls plug_vifs
https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1...

which calls
https://github.com/openstack/nova/blob/23b462d77df1a1d09c43d0918bca853ef3af1...
which loop over all the netrok interfaces and calls plug bring up back 
to where we started
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/vif.py#L717

so yes for calico nova create the tap on the destination during pre-live 
migration.
...
 - but currently the gap _after_ activation is a bigger concern for us.