[Nova] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML
Hi, We are upgrading from Jammy to Noble, and need to include the patch https://review.opendev.org/c/openstack/nova/+/967570?tab=comments so VMs could be created on Noble nodes. However, all of our existing VMs that were created without the patch could not be live migrated to Noble, and has latency in live migrating to Jammy WITH the patch. To be clear: Old VMs created on Jammy -> patch applied on Jammy and Noble -> VMs can be live migrated to Jammy now, but with latency (VMs are pingable almost 10 seconds after live migration completed); VMs can not be live migrated directly to Noble nodes. After live migrated to Jammy with latency, we could live migrate the VMs between Jammy nodes, and one-way to Noble nodes, without any latency. Would like to understand it more, is there any way to get rid of the latency issue? I tried to patch nova/virt/libvirt/migration.py to add managed='no' flag when migration is on fly, but it only allows me to live migrate the old VMs directly to Noble nodes, latency still exists. Thanks, Chang
Hi,
We are upgrading from Jammy to Noble, and need to include the patch https://review.opendev.org/c/openstack/nova/+/967570?tab=comments so VMs could be created on Noble nodes.
However, all of our existing VMs that were created without the patch could not be live migrated to Noble, and has latency in live migrating to Jammy WITH the patch. right so what your encountering is the fact that the behavior in libvirt change between jammy and noble in the jammy version the defautl behavior fo the ethernet interface type was to not recreate the tap if it exists on noble the defualt behvior is to recreated it nova didnt specify the manage flag in older release so vm migration was broken for calico on upgrade due to the cahnge in libvirt behvior
To be clear: Old VMs created on Jammy -> patch applied on Jammy and Noble -> VMs can be live migrated to Jammy now, but with latency (VMs are pingable almost 10 seconds after live migration completed); VMs can not be live migrated directly to Noble nodes. After live migrated to Jammy with latency, we could live migrate the VMs between Jammy nodes, and one-way to Noble nodes, without any latency. so nova/libvirt/qemu does not supprot movign a vm form an newer qemu/libvirt to an older one.
On 03/02/2026 18:12, Chang Xue wrote: that why you can only move one way form jamy to noble in nova. even though the live migration can sometimes succeeded at the qemu level they do not officially support going form the newer qemu to the older qemu so we block that by default in nova.
Would like to understand it more, is there any way to get rid of the latency issue? I tried to patch nova/virt/libvirt/migration.py to add managed='no' flag when migration is on fly, but it only allows me to live migrate the old VMs directly to Noble nodes, latency still exists.
the lattency is a result of calico. calico does not support wiring up the the destination for networking until after the port binding is activated which alwasy happens after the vm is running when it is activated there is a down time while the bgp route updates propagate i belive this is currently being worked on in the calico plugin. the api contract that is expected of a newutron backend is that when we do port plug in pre-live migrate the netowrk backend should only send the network-vif-plugged event when teh destination port is ready to send and resive packets. ml2/ovs and ml2/ovn to a lesser degree implement this effectively for ml2/ovn it has special handling to intercept the RARP packet that qemu sened before the guest is unpaused on the destination and activate the openflow rules early before the port bidning is activated on the destination. note it was ment to have installed those flows before the migration started so this is late but eilar then calico. both calico and ovn currently send there network-vif-plugged event before the network is plug due to limitation in how the sdn controller works today. for calico i don't know if its possible in bgp to advertise a secondary route or similr for the ip on the destination host and make it the primary route faster or if bgp propagation dely will alwasy be a bottelneck but calico could implement the same handeling of the RARP packet to trigger the switch over that ovn does. this is more of a neutron/calico question then a nova one.
Thanks, Chang
Hi Chang, Calico engineer here. Thanks Sean for your analysis so far, which looks right to me.
However, all of our existing VMs that were created without the patch could not be live migrated to Noble, and has latency in live migrating to Jammy WITH the patch.
Can I just check that you are keeping the version of OpenStack the same, apart from that small patch? For live migration from Jammy to Noble, Sean has covered that that isn't officially supported at the libvirt level. For live migration from {Jammy without patch} to {Jammy with patch}, does the added latency correlate reliably with the patch being present on the destination node? As Sean said, we are currently doing a lot of work to improve the performance of the Calico plugin for live migration, and in particular to minimize latency - but I don't think any of that relates to the Nova patch for libvirt. Is it possible that you randomly saw some outlier cases with high latency, in which the patch was present, but it might not have been the patch causing the problem? Best wishes - Nell
Hi Nell & Sean, This is Zhan, and I'm taking a look at the issue with Chang. I’m providing an update with our findings here, but first:
For live migration from Jammy to Noble, Sean has covered that that isn't officially supported at the libvirt level.
I am a bit late to the party and correct me if I’m wrong: I think Sean mentioned that only Jammy to Noble is supported, not the other way around, as it will make it impossible to upgrade?
For live migration from {Jammy without patch} to {Jammy with patch}, does the added latency correlate reliably with the patch being present on the destination node?
Let's take a step back and don’t worry about the patches. The additional latency will always exist when the domain's XML is updated during migrations, from without `managed=no` to with `managed=no`. If we don't update the XML during migrations (i.e., the XML before & after migration both don’t have `managed=no`, or the XML before & after migration both have `managed=no`), there is no additional latency.
the lattency is a result of calico. calico does not support wiring up the the destination for networking until after the port binding is activated which alwasy happens after the vm is running
I want to clarify here that by the increase in latency, we mean the normal latency is something like ~4s, and the increase in latency is something like ~30s. We are fully aware that Calico needs improvements on this, hence I submitted the Spec for refining live migration network update process in 2026.2, and as Nell mentioned that they are working on improving this too. Imo, we are approaching this problem from two angles :D. We do believe that the increase in latency (from ~4s to ~30s) is related to the `managed=no` patch. Please find our findings below.
but I don't think any of that relates to the Nova patch for libvirt.
After our investigation, this is actually related. Please take a look at libvirt's code when `managed=yes` [0] (which is the default). Without the patch introduced in libvirt v9.5.0 [1], `virNetDevTapCreate` will not error out when the tap interface already exists and `managed=yes`, and the function afterwards will still be executed (i.e., setting the mac address for the interfaces on hyp & VM, and bringing the device online). By specifying `managed=no`, libvirt is no longer doing this. On the Nova side, it actually does what libvirt does when the tap interface is created [2] (i.e., setting the mac address + bringing the interface up). However, the mac address that the function gets is from the vif's port binding's "mac_address" field [3], which is hard-coded when using networking-calico - the Neutron Calico driver, and this is wrong. So what happens before (with `managed=yes`) is that even though the mac address is wrongly updated, libvirt will rewrite the mac address later when Nova calls the migration API, so we won’t hit this issue. What happens now when we migrate the VM with the updated `managed=no` is that: 1. When the VM is created first without `managed=no`, libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side. VM learns this, and it's pingable. 2. When the VM is migrated with the updated XML (including `managed=no`), libvirt will NOT overwrite the mac address for the tap interface. Thus, when Nova creates the TAP interface and sets the mac address of the tap interface on the hypervisor side with the hard-coded mac address (i.e., `00:61:fe:ed:ca:fe`), this change will be persisted after the live migration. 3. When the VM is resumed, it doesn't know that the hypervisor side interface mac address has changed, and is still sending packets with the old mac address (i.e., `fe:xx:xx:xx:xx:xx`). Hypervisor sees that there is no matching mac address, and will drop the packet. Running `tcpdump`, we were able to see that the VM is answering ping packets, but the replies never go out of the hypervisor. 4. At some point later, VM will send ARP to ask for the new mac address, and it will become pingable when it gets the answer. So the key point here is to make sure that the mac addresses on both the VM and the hypervisor side are still the same before & after live migration. This is also why we don't see a latency increase when migrating with the `managed=no` flag present in both the XML before & after - the mac addresses are the same. I came up with a small patch in Nova to “fix” this by reading the port's mac address and do the reverse of what libvirt is doing (i.e., I know the port’s mac address is `fa:xx:xx:xx:xx:xx`, I set the tap interface's mac address to be `fe:xx:xx:xx:xx:xx` when creating the tap interface [3]). But before I file a bug report, Sean, I would like to understand: 1. Given now we assume `managed=no`, should Nova take the responsibility of setting the mac address correctly? 2. If so, we will need to re-evaluate all things that were done by libvirt with `managed=yes` before and make sure Nova will do them, since we changed the default to `managed=no` now. 3. Should Nova read from `vif['address']` for mac address, instead of `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)`? And for Nell: 1. Should networking-calico be modified to have `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)` field to reflect the actual mac address of the port? Thank you for helping out! Best regards, Zhan Zhang [0]: https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#... [1]: https://github.com/libvirt/libvirt/commit/a2ae3d299cf [2]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/privsep/lin... [3]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir... [4]: https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo...
participants (4)
-
Chang Xue
-
Nell Jerram
-
Sean Mooney
-
Zhan Zhang