Hi Nell & Sean, This is Zhan, and I'm taking a look at the issue with Chang. I’m providing an update with our findings here, but first:
For live migration from Jammy to Noble, Sean has covered that that isn't officially supported at the libvirt level.
I am a bit late to the party and correct me if I’m wrong: I think Sean mentioned that only Jammy to Noble is supported, not the other way around, as it will make it impossible to upgrade?
For live migration from {Jammy without patch} to {Jammy with patch}, does the added latency correlate reliably with the patch being present on the destination node?
Let's take a step back and don’t worry about the patches. The additional latency will always exist when the domain's XML is updated during migrations, from without `managed=no` to with `managed=no`. If we don't update the XML during migrations (i.e., the XML before & after migration both don’t have `managed=no`, or the XML before & after migration both have `managed=no`), there is no additional latency.
the lattency is a result of calico. calico does not support wiring up the the destination for networking until after the port binding is activated which alwasy happens after the vm is running
I want to clarify here that by the increase in latency, we mean the normal latency is something like ~4s, and the increase in latency is something like ~30s. We are fully aware that Calico needs improvements on this, hence I submitted the Spec for refining live migration network update process in 2026.2, and as Nell mentioned that they are working on improving this too. Imo, we are approaching this problem from two angles :D. We do believe that the increase in latency (from ~4s to ~30s) is related to the `managed=no` patch. Please find our findings below.
but I don't think any of that relates to the Nova patch for libvirt.
After our investigation, this is actually related. Please take a look at libvirt's code when `managed=yes` [0] (which is the default). Without the patch introduced in libvirt v9.5.0 [1], `virNetDevTapCreate` will not error out when the tap interface already exists and `managed=yes`, and the function afterwards will still be executed (i.e., setting the mac address for the interfaces on hyp & VM, and bringing the device online). By specifying `managed=no`, libvirt is no longer doing this. On the Nova side, it actually does what libvirt does when the tap interface is created [2] (i.e., setting the mac address + bringing the interface up). However, the mac address that the function gets is from the vif's port binding's "mac_address" field [3], which is hard-coded when using networking-calico - the Neutron Calico driver, and this is wrong. So what happens before (with `managed=yes`) is that even though the mac address is wrongly updated, libvirt will rewrite the mac address later when Nova calls the migration API, so we won’t hit this issue. What happens now when we migrate the VM with the updated `managed=no` is that: 1. When the VM is created first without `managed=no`, libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side. VM learns this, and it's pingable. 2. When the VM is migrated with the updated XML (including `managed=no`), libvirt will NOT overwrite the mac address for the tap interface. Thus, when Nova creates the TAP interface and sets the mac address of the tap interface on the hypervisor side with the hard-coded mac address (i.e., `00:61:fe:ed:ca:fe`), this change will be persisted after the live migration. 3. When the VM is resumed, it doesn't know that the hypervisor side interface mac address has changed, and is still sending packets with the old mac address (i.e., `fe:xx:xx:xx:xx:xx`). Hypervisor sees that there is no matching mac address, and will drop the packet. Running `tcpdump`, we were able to see that the VM is answering ping packets, but the replies never go out of the hypervisor. 4. At some point later, VM will send ARP to ask for the new mac address, and it will become pingable when it gets the answer. So the key point here is to make sure that the mac addresses on both the VM and the hypervisor side are still the same before & after live migration. This is also why we don't see a latency increase when migrating with the `managed=no` flag present in both the XML before & after - the mac addresses are the same. I came up with a small patch in Nova to “fix” this by reading the port's mac address and do the reverse of what libvirt is doing (i.e., I know the port’s mac address is `fa:xx:xx:xx:xx:xx`, I set the tap interface's mac address to be `fe:xx:xx:xx:xx:xx` when creating the tap interface [3]). But before I file a bug report, Sean, I would like to understand: 1. Given now we assume `managed=no`, should Nova take the responsibility of setting the mac address correctly? 2. If so, we will need to re-evaluate all things that were done by libvirt with `managed=yes` before and make sure Nova will do them, since we changed the default to `managed=no` now. 3. Should Nova read from `vif['address']` for mac address, instead of `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)`? And for Nell: 1. Should networking-calico be modified to have `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)` field to reflect the actual mac address of the port? Thank you for helping out! Best regards, Zhan Zhang [0]: https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#... [1]: https://github.com/libvirt/libvirt/commit/a2ae3d299cf [2]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/privsep/lin... [3]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir... [4]: https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo...