[Nova] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

newer
[cyborg] Restarting Weekly Cyborg...

older
[cinder][nova][os-brick] Guidance...

Chang Xue

3 Feb 2026 3 Feb '26

7:12 p.m.

Hi, We are upgrading from Jammy to Noble, and need to include the patch https://review.opendev.org/c/openstack/nova/+/967570?tab=comments so VMs could be created on Noble nodes. However, all of our existing VMs that were created without the patch could not be live migrated to Noble, and has latency in live migrating to Jammy WITH the patch. To be clear: Old VMs created on Jammy -> patch applied on Jammy and Noble -> VMs can be live migrated to Jammy now, but with latency (VMs are pingable almost 10 seconds after live migration completed); VMs can not be live migrated directly to Noble nodes. After live migrated to Jammy with latency, we could live migrate the VMs between Jammy nodes, and one-way to Noble nodes, without any latency. Would like to understand it more, is there any way to get rid of the latency issue? I tried to patch nova/virt/libvirt/migration.py to add managed='no' flag when migration is on fly, but it only allows me to live migrate the old VMs directly to Noble nodes, latency still exists. Thanks, Chang

Show replies by date

Sean Mooney

3 Feb 3 Feb

7:27 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

...

Hi,

We are upgrading from Jammy to Noble, and need to include the patch https://review.opendev.org/c/openstack/nova/+/967570?tab=comments so VMs could be created on Noble nodes.

However, all of our existing VMs that were created without the patch could not be live migrated to Noble, and has latency in live migrating to Jammy WITH the patch. right so what your encountering is the fact that the behavior in libvirt change between jammy and noble in the jammy version the defautl behavior fo the ethernet interface type was to not recreate the tap if it exists on noble the defualt behvior is to recreated it nova didnt specify the manage flag in older release so vm migration was broken for calico on upgrade due to the cahnge in libvirt behvior

To be clear: Old VMs created on Jammy -> patch applied on Jammy and Noble -> VMs can be live migrated to Jammy now, but with latency (VMs are pingable almost 10 seconds after live migration completed); VMs can not be live migrated directly to Noble nodes. After live migrated to Jammy with latency, we could live migrate the VMs between Jammy nodes, and one-way to Noble nodes, without any latency. so nova/libvirt/qemu does not supprot movign a vm form an newer qemu/libvirt to an older one.

On 03/02/2026 18:12, Chang Xue wrote: that why you can only move one way form jamy to noble in nova. even though the live migration can sometimes succeeded at the qemu level they do not officially support going form the newer qemu to the older qemu so we block that by default in nova.

...

Would like to understand it more, is there any way to get rid of the latency issue? I tried to patch nova/virt/libvirt/migration.py to add managed='no' flag when migration is on fly, but it only allows me to live migrate the old VMs directly to Noble nodes, latency still exists.

the lattency is a result of calico. calico does not support wiring up the the destination for networking until after the port binding is activated which alwasy happens after the vm is running when it is activated there is a down time while the bgp route updates propagate i belive this is currently being worked on in the calico plugin. the api contract that is expected of a newutron backend is that when we do port plug in pre-live migrate the netowrk backend should only send the network-vif-plugged event when teh destination port is ready to send and resive packets. ml2/ovs and ml2/ovn to a lesser degree implement this effectively for ml2/ovn it has special handling to intercept the RARP packet that qemu sened before the guest is unpaused on the destination and activate the openflow rules early before the port bidning is activated on the destination. note it was ment to have installed those flows before the migration started so this is late but eilar then calico. both calico and ovn currently send there network-vif-plugged event before the network is plug due to limitation in how the sdn controller works today. for calico i don't know if its possible in bgp to advertise a secondary route or similr for the ip on the destination host and make it the primary route faster or if bgp propagation dely will alwasy be a bottelneck but calico could implement the same handeling of the RARP packet to trigger the switch over that ovn does. this is more of a neutron/calico question then a nova one.

...

Thanks, Chang

Nell Jerram

4 Feb 4 Feb

11:09 a.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

Hi Chang, Calico engineer here. Thanks Sean for your analysis so far, which looks right to me.

...

However, all of our existing VMs that were created without the patch could not be live migrated to Noble, and has latency in live migrating to Jammy WITH the patch.

Can I just check that you are keeping the version of OpenStack the same, apart from that small patch? For live migration from Jammy to Noble, Sean has covered that that isn't officially supported at the libvirt level. For live migration from {Jammy without patch} to {Jammy with patch}, does the added latency correlate reliably with the patch being present on the destination node? As Sean said, we are currently doing a lot of work to improve the performance of the Calico plugin for live migration, and in particular to minimize latency - but I don't think any of that relates to the Nova patch for libvirt. Is it possible that you randomly saw some outlier cases with high latency, in which the patch was present, but it might not have been the patch causing the problem? Best wishes - Nell

Zhan Zhang

5 Feb 5 Feb

9 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

Hi Nell & Sean, This is Zhan, and I'm taking a look at the issue with Chang. I’m providing an update with our findings here, but first:

...

For live migration from Jammy to Noble, Sean has covered that that isn't officially supported at the libvirt level.

I am a bit late to the party and correct me if I’m wrong: I think Sean mentioned that only Jammy to Noble is supported, not the other way around, as it will make it impossible to upgrade?

...

For live migration from {Jammy without patch} to {Jammy with patch}, does the added latency correlate reliably with the patch being present on the destination node?

Let's take a step back and don’t worry about the patches. The additional latency will always exist when the domain's XML is updated during migrations, from without `managed=no` to with `managed=no`. If we don't update the XML during migrations (i.e., the XML before & after migration both don’t have `managed=no`, or the XML before & after migration both have `managed=no`), there is no additional latency.

...

the lattency is a result of calico. calico does not support wiring up the the destination for networking until after the port binding is activated which alwasy happens after the vm is running

I want to clarify here that by the increase in latency, we mean the normal latency is something like ~4s, and the increase in latency is something like ~30s. We are fully aware that Calico needs improvements on this, hence I submitted the Spec for refining live migration network update process in 2026.2, and as Nell mentioned that they are working on improving this too. Imo, we are approaching this problem from two angles :D. We do believe that the increase in latency (from ~4s to ~30s) is related to the `managed=no` patch. Please find our findings below.

...

but I don't think any of that relates to the Nova patch for libvirt.

After our investigation, this is actually related. Please take a look at libvirt's code when `managed=yes` [0] (which is the default). Without the patch introduced in libvirt v9.5.0 [1], `virNetDevTapCreate` will not error out when the tap interface already exists and `managed=yes`, and the function afterwards will still be executed (i.e., setting the mac address for the interfaces on hyp & VM, and bringing the device online). By specifying `managed=no`, libvirt is no longer doing this. On the Nova side, it actually does what libvirt does when the tap interface is created [2] (i.e., setting the mac address + bringing the interface up). However, the mac address that the function gets is from the vif's port binding's "mac_address" field [3], which is hard-coded when using networking-calico - the Neutron Calico driver, and this is wrong. So what happens before (with `managed=yes`) is that even though the mac address is wrongly updated, libvirt will rewrite the mac address later when Nova calls the migration API, so we won’t hit this issue. What happens now when we migrate the VM with the updated `managed=no` is that: 1. When the VM is created first without `managed=no`, libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side. VM learns this, and it's pingable. 2. When the VM is migrated with the updated XML (including `managed=no`), libvirt will NOT overwrite the mac address for the tap interface. Thus, when Nova creates the TAP interface and sets the mac address of the tap interface on the hypervisor side with the hard-coded mac address (i.e., `00:61:fe:ed:ca:fe`), this change will be persisted after the live migration. 3. When the VM is resumed, it doesn't know that the hypervisor side interface mac address has changed, and is still sending packets with the old mac address (i.e., `fe:xx:xx:xx:xx:xx`). Hypervisor sees that there is no matching mac address, and will drop the packet. Running `tcpdump`, we were able to see that the VM is answering ping packets, but the replies never go out of the hypervisor. 4. At some point later, VM will send ARP to ask for the new mac address, and it will become pingable when it gets the answer. So the key point here is to make sure that the mac addresses on both the VM and the hypervisor side are still the same before & after live migration. This is also why we don't see a latency increase when migrating with the `managed=no` flag present in both the XML before & after - the mac addresses are the same. I came up with a small patch in Nova to “fix” this by reading the port's mac address and do the reverse of what libvirt is doing (i.e., I know the port’s mac address is `fa:xx:xx:xx:xx:xx`, I set the tap interface's mac address to be `fe:xx:xx:xx:xx:xx` when creating the tap interface [3]). But before I file a bug report, Sean, I would like to understand: 1. Given now we assume `managed=no`, should Nova take the responsibility of setting the mac address correctly? 2. If so, we will need to re-evaluate all things that were done by libvirt with `managed=yes` before and make sure Nova will do them, since we changed the default to `managed=no` now. 3. Should Nova read from `vif['address']` for mac address, instead of `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)`? And for Nell: 1. Should networking-calico be modified to have `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)` field to reflect the actual mac address of the port? Thank you for helping out! Best regards, Zhan Zhang [0]: https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#... [1]: https://github.com/libvirt/libvirt/commit/a2ae3d299cf [2]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/privsep/lin... [3]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir... [4]: https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo...

Sean Mooney

9 Feb 9 Feb

12:23 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

...

Hi Nell & Sean,

This is Zhan, and I'm taking a look at the issue with Chang. I’m providing an update with our findings here, but first:

...
For live migration from Jammy to Noble, Sean has covered that that isn't officially supported at the libvirt level. I am a bit late to the party and correct me if I’m wrong: I think Sean mentioned that only Jammy to Noble is supported, not the other way around, as it will make it impossible to upgrade?

...
For live migration from {Jammy without patch} to {Jammy with patch}, does the added latency correlate reliably with the patch being present on the destination node? Let's take a step back and don’t worry about the patches. The additional latency will always exist when the domain's XML is updated during migrations, from without `managed=no` to with `managed=no`. If we don't update the XML during migrations (i.e., the XML before & after migration both don’t have `managed=no`, or the XML before & after migration both have `managed=no`), there is no additional latency.

...
the lattency is a result of calico. calico does not support wiring up the the destination for networking until after the port binding is activated which alwasy happens after the vm is running I want to clarify here that by the increase in latency, we mean the normal latency is something like ~4s, and the increase in latency is something like ~30s. We are fully aware that Calico needs improvements on this, hence I submitted the Spec for refining live migration network update process in 2026.2, and as Nell mentioned that they are working on improving this too. Imo, we are approaching this problem from two angles :D. We do believe that the increase in latency (from ~4s to ~30s) is related to the `managed=no` patch. Please find our findings below.

...
but I don't think any of that relates to the Nova patch for libvirt. After our investigation, this is actually related. Please take a look at libvirt's code when `managed=yes` [0] (which is the default). Without the patch introduced in libvirt v9.5.0 [1], `virNetDevTapCreate` will not error out when the tap interface already exists and `managed=yes`, and the function afterwards will still be executed (i.e., setting the mac address for the interfaces on hyp & VM, and bringing the device online). By specifying `managed=no`, libvirt is no longer doing this.

On the Nova side, it actually does what libvirt does when the tap interface is created [2] (i.e., setting the mac address + bringing the interface up). However, the mac address that the function gets is from the vif's port binding's "mac_address" field [3], which is hard-coded when using networking-calico - the Neutron Calico driver, and this is wrong. so normally the mac is taken directly form the neuton port rather then

On 05/02/2026 19:59, Zhan Zhang wrote: the detial sub filed in the neutron port. so yes this does look incorrect in that regargd however this deviation was called out in the orgihnal commit https://opendev.org/openstack/nova/commit/e0bca279d53f866d17834cdee025cda819... """ VIF_TYPE_TAP supports a 'mac_address' key, in the VIF details dict, which allows Neutron to specify the MAC address for the host end of the TAP interface. """ so it looks like the calico driver was written to intentially have the ablity to have a different back for the host side of the tap device that is not the same as the guest side. https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo... however does indeed show that it was hard coded on the networkign-calico side so this seams to have only been workign becasue libvirt was overriaded the mac.

...

So what happens before (with `managed=yes`) is that even though the mac address is wrongly updated, libvirt will rewrite the mac address later when Nova calls the migration API, so we won’t hit this issue. What happens now when we migrate the VM with the updated `managed=no` is that:

1. When the VM is created first without `managed=no`, libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side. VM learns this, and it's pingable. 2. When the VM is migrated with the updated XML (including `managed=no`), libvirt will NOT overwrite the mac address for the tap interface. Thus, when Nova creates the TAP interface and sets the mac address of the tap interface on the hypervisor side with the hard-coded mac address (i.e., `00:61:fe:ed:ca:fe`), this change will be persisted after the live migration. 3. When the VM is resumed, it doesn't know that the hypervisor side interface mac address has changed, and is still sending packets with the old mac address (i.e., `fe:xx:xx:xx:xx:xx`). Hypervisor sees that there is no matching mac address, and will drop the packet. Running `tcpdump`, we were able to see that the VM is answering ping packets, but the replies never go out of the hypervisor. 4. At some point later, VM will send ARP to ask for the new mac address, and it will become pingable when it gets the answer.

So the key point here is to make sure that the mac addresses on both the VM and the hypervisor side are still the same before & after live migration. This is also why we don't see a latency increase when migrating with the `managed=no` flag present in both the XML before & after - the mac addresses are the same.

I came up with a small patch in Nova to “fix” this by reading the port's mac address and do the reverse of what libvirt is doing (i.e., I know the port’s mac address is `fa:xx:xx:xx:xx:xx`, I set the tap interface's mac address to be `fe:xx:xx:xx:xx:xx` when creating the tap interface [3]). well the actuall bug seams to be in networking caliclo its pretty clear from the nova code that the intent was for the mech driver to calulate and provide a diffent mac for the host but it was never implemented on the neutron side.

...

But before I file a bug report, Sean, I would like to understand:

1. Given now we assume `managed=no`, should Nova take the responsibility of setting the mac address correctly? short term yes but the code you provided shows that it already is

...

2. If so, we will need to re-evaluate all things that were done by libvirt with `managed=yes` before and make sure Nova will do them, since we changed the default to `managed=no` now. to be clear libivrt defautl in the past as also effecitlgy manged=no n

...

3. Should Nova read from `vif['address']` for mac address, instead of `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)`? not form looking at the orginal commit

https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir... plug tap is using the value for the vif details field when creating the tap. medium term |plug_tap is a legacy fucntion and this should be move to os-vif that was inteded to be doen many many years ago but no one got around to it. so now would be a | that it previoslyu did nto delete the tap it may have modifed it (its mac) but they change the behvior changing the meaning of the xml we generated. we now use manged=no so all tap creation and configurtion need to happen in nova or idally in os-vif i recnetly added tap creation logic to os-vif for ovs https://github.com/openstack/os-vif/commit/eba8007607381736b23e0a0ac672981e7... ideally next cycle we woudl add a vif_plug_tap plugin for all backened that use vif_type=tap which is calico and in teh past medionet but that nolonger maintied the nova code was intentioally writen this way so that networking-calico coudl specify the host side of the tap to have a diffeent mac i don't know why calico requires that but the orgianly code belived it was requried. networking-calico is not allowed to assuem which nova hypervior is in used so it cannot deplend on the libvirt semantics "libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side." if that is the desired behavior then it need to specify that by setting `fe:xx:xx:xx:xx:xx` in the the vif details. vif['address'] is the mac that shoudl be visable to the guest not the mac of the port on the host.

...

And for Nell:

1. Should networking-calico be modified to have `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)` field to reflect the actual mac address of the port?

it shoudl be the mac of the host tap vif['address']is the mac of the guest interface and neutron port.

...

Thank you for helping out!

Best regards, Zhan Zhang

[0]: https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#... [1]: https://github.com/libvirt/libvirt/commit/a2ae3d299cf [2]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/privsep/lin... [3]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir... [4]: https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo...

Nell Jerram

3:53 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

On Mon, Feb 9, 2026 at 11:25 AM Sean Mooney <smooney@redhat.com> wrote:

...

...
On the Nova side, it actually does what libvirt does when the tap interface is created [2] (i.e., setting the mac address + bringing the interface up). However, the mac address that the function gets is from the vif's port binding's "mac_address" field [3], which is hard-coded when using networking-calico - the Neutron Calico driver, and this is wrong. so normally the mac is taken directly form the neuton port rather then

On 05/02/2026 19:59, Zhan Zhang wrote: the detial sub filed in the neutron port. so yes this does look incorrect in that regargd however this deviation was called out in the orgihnal commit

https://opendev.org/openstack/nova/commit/e0bca279d53f866d17834cdee025cda819...

""" VIF_TYPE_TAP supports a 'mac_address' key, in the VIF details dict, which allows Neutron to specify the MAC address for the host end of the TAP interface.

"""

so it looks like the calico driver was written to intentially have the ablity to have a different back for the host side of the tap device that is not the same as the guest side.

https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo... however does indeed show that it was hard coded on the networkign-calico side

Indeed. I can't remember why we made those choices 12 years ago, but yes, the effect is that we have a hardcoded MAC on the host side of every guest interface, and that it's the same in every case. What exactly do you see as being wrong about that, Sean? so this seams to have only been workign becasue libvirt was overriaded the

...

mac.

Can you expand on that? What does "only been working" mean, and what do you mean by libvirt overriding the MAC?

...

...
So what happens before (with `managed=yes`) is that even though the mac address is wrongly updated, libvirt will rewrite the mac address later when Nova calls the migration API, so we won’t hit this issue. What happens now when we migrate the VM with the updated `managed=no` is that:

Can you expand on "before (with `managed=yes`)" ? My understanding is that VIF_TYPE_TAP devices have never behaved successfully with `managed=yes`.

...

...
2. When the VM is migrated with the updated XML (including `managed=no`), libvirt will NOT overwrite the mac address for the tap interface. Thus, when Nova creates the TAP interface and sets the mac address of the tap interface on the hypervisor side with the hard-coded mac address (i.e., `00:61:fe:ed:ca:fe`), this change will be persisted after

...
3. When the VM is resumed, it doesn't know that the hypervisor side interface mac address has changed, and is still sending packets with the

...
4. At some point later, VM will send ARP to ask for the new mac address, and it will become pingable when it gets the answer.

So the key point here is to make sure that the mac addresses on both the VM and the hypervisor side are still the same before & after live migration. This is also why we don't see a latency increase when migrating with the `managed=no` flag present in both the XML before & after - the mac addresses are the same.

I came up with a small patch in Nova to “fix” this by reading the port's mac address and do the reverse of what libvirt is doing (i.e., I know the

1. When the VM is created first without `managed=no`, libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side. VM learns this, and it's pingable. the live migration. old mac address (i.e., `fe:xx:xx:xx:xx:xx`). Hypervisor sees that there is no matching mac address, and will drop the packet. Running `tcpdump`, we were able to see that the VM is answering ping packets, but the replies never go out of the hypervisor. port’s mac address is `fa:xx:xx:xx:xx:xx`, I set the tap interface's mac address to be `fe:xx:xx:xx:xx:xx` when creating the tap interface [3]). well the actuall bug seams to be in networking caliclo its pretty clear from the nova code that the intent was for the mech driver to calulate and provide a diffent mac for the host but it was never implemented on the neutron side.

...
But before I file a bug report, Sean, I would like to understand:

1. Given now we assume `managed=no`, should Nova take the responsibility of setting the mac address correctly? short term yes but the code you provided shows that it already is

https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir...

plug tap is using the value for the vif details field when creating the tap.

...
2. If so, we will need to re-evaluate all things that were done by

medium term |plug_tap is a legacy fucntion and this should be move to os-vif that was inteded to be doen many many years ago but no one got around to it. so now would be a | libvirt with `managed=yes` before and make sure Nova will do them, since we changed the default to `managed=no` now. to be clear libivrt defautl in the past as also effecitlgy manged=no n that it previoslyu did nto delete the tap it may have modifed it (its mac) but they change the behvior changing the meaning of the xml we generated.

we now use manged=no so all tap creation and configurtion need to happen in nova or idally in os-vif i recnetly added tap creation logic to os-vif for ovs

https://github.com/openstack/os-vif/commit/eba8007607381736b23e0a0ac672981e7... ideally next cycle we woudl add a vif_plug_tap plugin for all backened that use vif_type=tap which is calico and in teh past medionet but that nolonger maintied

...
3. Should Nova read from `vif['address']` for mac address, instead of `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)`? not form looking at the orginal commit the nova code was intentioally writen this way so that networking-calico coudl specify the host side of the tap to have a diffeent mac i don't know why calico requires that but the orgianly code belived it was requried.

networking-calico is not allowed to assuem which nova hypervior is in used so it cannot deplend on the libvirt semantics

"libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side."

if that is the desired behavior then it need to specify that by setting `fe:xx:xx:xx:xx:xx` in the the vif details.

vif['address'] is the mac that shoudl be visable to the guest not the mac of the port on the host.

...
And for Nell:

1. Should networking-calico be modified to have

`vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)` field to reflect the actual mac address of the port? it shoudl be the mac of the host tap vif['address']is the mac of the guest interface and neutron port.

I'm happy to look at changes here, but at the moment I'm afraid I'm still at the stage of trying to understand the current analysis. Best wishes - Nell

...

...
Thank you for helping out!

Best regards, Zhan Zhang

[0]: https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#... [1]: https://github.com/libvirt/libvirt/commit/a2ae3d299cf [2]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/privsep/lin... [3]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir... [4]: https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo...

Sean Mooney

4:19 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

On 09/02/2026 14:53, Nell Jerram wrote:

...

On Mon, Feb 9, 2026 at 11:25 AM Sean Mooney <smooney@redhat.com> wrote:

On 05/02/2026 19:59, Zhan Zhang wrote: > On the Nova side, it actually does what libvirt does when the tap interface is created [2] (i.e., setting the mac address + bringing the interface up). However, the mac address that the function gets is from the vif's port binding's "mac_address" field [3], which is hard-coded when using networking-calico - the Neutron Calico driver, and this is wrong. so normally the mac is taken directly form the neuton port rather then the detial sub filed in the neutron port. so yes this does look incorrect in that regargd however this deviation was called out in the orgihnal commit https://opendev.org/openstack/nova/commit/e0bca279d53f866d17834cdee025cda819...

""" VIF_TYPE_TAP supports a 'mac_address' key, in the VIF details dict, which allows Neutron to specify the MAC address for the host end of the TAP interface.

"""

so it looks like the calico driver was written to intentially have the ablity to have a different back for the host side of the tap device that is not the same as the guest side.

https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo... however does indeed show that it was hard coded on the networkign-calico side

Indeed. I can't remember why we made those choices 12 years ago, but yes, the effect is that we have a hardcoded MAC on the host side of every guest interface, and that it's the same in every case.

What exactly do you see as being wrong about that, Sean? well for one you should generally not have 2 netdevs with the same mac in a given network namespace i know that you can do that but it breaks l2 routing.

claico is a l3 network stack so it may not care but its generally bad practice to do. in any case form a nova perspective the mac adress that we use for the vm should be specified by neutron for ovs we are usign a single mac (the neutron port mac) for the interface added to the vm and the tap on the host added to ovs. for calico we are using the port mac for the virtio-net interface precedented to the guest and the hard coded mac for the tap interface on the host and nova is just using the value generated by the calico mech driver. so if this is wrogn we shoudl fix the mech driver. if there is no reason for them to be different we could consider remvoing the supprot for that in nova but that may have upgrade impacts it woudl be better to update the mech driver to have it set the same value in both places first and then update nova in a later release.

...

so this seams to have only been workign becasue libvirt was overriaded the mac.

Can you expand on that? What does "only been working" mean, and what do you mean by libvirt overriding the MAC?

zhang asserted that "libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side" if that is the desired bevhior then we need to update the mech driver for calico to generate those two macs. nova should not have logic to do that transformation form fe to fa

...

> So what happens before (with `managed=yes`) is that even though the mac address is wrongly updated, libvirt will rewrite the mac address later when Nova calls the migration API, so we won’t hit this issue. What happens now when we migrate the VM with the updated `managed=no` is that:

Can you expand on "before (with `managed=yes`)" ? My understanding is that VIF_TYPE_TAP devices have never behaved successfully with `managed=yes`.

this was imprecise nova did not set managed to any value historically. prior to https://github.com/libvirt/libvirt/commit/a2ae3d299cf the tap was allowed to already exist and the default in libvirt was manged=yes i belive nova was not settign it to yes or not https://review.opendev.org/c/openstack/nova/+/960284 updated nova to set ti to no. to allow calico to work again. my understading form this thread is with older libvirt and the conig we specifed the mac of the tap was modifed V

...

> 1. When the VM is created first without `managed=no`, libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side. VM learns this, and it's pingable.

so if ^ is ture we shoudl restore that behviro but the crrect way to do tha tis to modify the mac defiend in the vif['details'] field to have the correct mac for the tap. im not expressing any opion on what that mac value should be just that it is the responiblity fo the mech driver to set ti to a value that will work for both new boots and for live migration.

...

> 2. When the VM is migrated with the updated XML (including `managed=no`), libvirt will NOT overwrite the mac address for the tap interface. Thus, when Nova creates the TAP interface and sets the mac address of the tap interface on the hypervisor side with the hard-coded mac address (i.e., `00:61:fe:ed:ca:fe`), this change will be persisted after the live migration. > 3. When the VM is resumed, it doesn't know that the hypervisor side interface mac address has changed, and is still sending packets with the old mac address (i.e., `fe:xx:xx:xx:xx:xx`). Hypervisor sees that there is no matching mac address, and will drop the packet. Running `tcpdump`, we were able to see that the VM is answering ping packets, but the replies never go out of the hypervisor. > 4. At some point later, VM will send ARP to ask for the new mac address, and it will become pingable when it gets the answer. > > So the key point here is to make sure that the mac addresses on both the VM and the hypervisor side are still the same before & after live migration. This is also why we don't see a latency increase when migrating with the `managed=no` flag present in both the XML before & after - the mac addresses are the same. > > I came up with a small patch in Nova to “fix” this by reading the port's mac address and do the reverse of what libvirt is doing (i.e., I know the port’s mac address is `fa:xx:xx:xx:xx:xx`, I set the tap interface's mac address to be `fe:xx:xx:xx:xx:xx` when creating the tap interface [3]). well the actuall bug seams to be in networking caliclo its pretty clear from the nova code that the intent was for the mech driver to calulate and provide a diffent mac for the host but it was never implemented on the neutron side.

> But before I file a bug report, Sean, I would like to understand: > > 1. Given now we assume `managed=no`, should Nova take the responsibility of setting the mac address correctly? short term yes but the code you provided shows that it already is

https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir...

plug tap is using the value for the vif details field when creating the tap.

medium term |plug_tap is a legacy fucntion and this should be move to os-vif that was inteded to be doen many many years ago but no one got around to it. so now would be a | > 2. If so, we will need to re-evaluate all things that were done by libvirt with `managed=yes` before and make sure Nova will do them, since we changed the default to `managed=no` now. to be clear libivrt defautl in the past as also effecitlgy manged=no n that it previoslyu did nto delete the tap it may have modifed it (its mac) but they change the behvior changing the meaning of the xml we generated.

we now use manged=no so all tap creation and configurtion need to happen in nova or idally in os-vif i recnetly added tap creation logic to os-vif for ovs https://github.com/openstack/os-vif/commit/eba8007607381736b23e0a0ac672981e7... ideally next cycle we woudl add a vif_plug_tap plugin for all backened that use vif_type=tap which is calico and in teh past medionet but that nolonger maintied

> 3. Should Nova read from `vif['address']` for mac address, instead of `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)`? not form looking at the orginal commit the nova code was intentioally writen this way so that networking-calico coudl specify the host side of the tap to have a diffeent mac i don't know why calico requires that but the orgianly code belived it was requried.

networking-calico is not allowed to assuem which nova hypervior is in used so it cannot deplend on the libvirt semantics

"libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side."

if that is the desired behavior then it need to specify that by setting `fe:xx:xx:xx:xx:xx` in the the vif details.

vif['address'] is the mac that shoudl be visable to the guest not the mac of the port on the host.

> > And for Nell: > > 1. Should networking-calico be modified to have `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)` field to reflect the actual mac address of the port? it shoudl be the mac of the host tap vif['address']is the mac of the guest interface and neutron port.

I'm happy to look at changes here, but at the moment I'm afraid I'm still at the stage of trying to understand the current analysis.

yep i think thats where most of us are. trying to understand what the intened behvior is, what it was in the past and how to align the two going forward. as it stands i do not see a bug on the nova side but the fact the mac apprease to be hardcoded on the networking-calico side for the host tap does look like a bug. unfortunately i dont see anything that indicate why it was hard coded.

...

Best wishes - Nell

> > Thank you for helping out! > > Best regards, > Zhan Zhang > > [0]: https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#... > [1]: https://github.com/libvirt/libvirt/commit/a2ae3d299cf > [2]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/privsep/lin... > [3]: https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir... > [4]: https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo... >

Zhan Zhang

4:45 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

...

Can you expand on that? What does "only been working" mean, and what do you mean by libvirt overriding the MAC?

As suggested by the code here: https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#..., libvirt is overriding the MAC address for both the VM side interface and tap interface when `managed=yes`. So without libvirt, Nova would be setting the hardcoded mac address filled by networking-calico and thus resulting conflicts in mac address mapping, as multiple interface would have the same mac address.

...

Can you expand on "before (with `managed=yes`)" ? My understanding is that VIF_TYPE_TAP devices have never behaved successfully with `managed=yes`.

`managed=yes` was always the default and it works before the libvirt patch was introduced. Before, the function for creating the tap interface will not error out even if the tap interface exists before that patch. After that patch, libvirt will error out if the tap interface already exists. So in conclusion, we were always running `managed=yes` before the patch and it was working.

...

if that is the desired bevhior then we need to update the mech driver for calico to generate those two macs. nova should not have logic to do that transformation form fe to fa

Agreed. I was asking question to be clear where the bug report should land - and Nova is doing what it is supposed to do here, so I think the only thing we need to "fix" is to make sure networking-calico update the vif["details"] field with correct host-side TAP interface mac address, which we can still stay align with how libvirt does things, so it is easier to fix on existing cloud.

Zhan Zhang

4:56 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

...

if there is no reason for them to be different we could consider remvoing the supprot for that in nova but that may have upgrade impacts it woudl be better to update the mech driver to have it set the same value in both places first and then update nova in a later release.

We'll still need to test this but reading the comments in https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#... I don't think so, the mac addresses have to be different if this is true. In addition, given this "increase in latency" issue exists, my guess is that the packets will include the hyp-side and VM-side mac addresses, because hypervisor is dropping them when VM uses the wrong mac address. If this is the case, then if we have the same mac address on both the VM side and hyp side, then I would expect errors like "received packet on vnetX with own address as source address" come up. So they probably still need to be different.

Nell Jerram

5:04 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

On Mon, Feb 9, 2026 at 3:19 PM Sean Mooney <smooney@redhat.com> wrote:

...

On 09/02/2026 14:53, Nell Jerram wrote:

...
On Mon, Feb 9, 2026 at 11:25 AM Sean Mooney <smooney@redhat.com> wrote:

On 05/02/2026 19:59, Zhan Zhang wrote: > On the Nova side, it actually does what libvirt does when the tap interface is created [2] (i.e., setting the mac address + bringing the interface up). However, the mac address that the function gets is from the vif's port binding's "mac_address" field [3], which is hard-coded when using networking-calico - the Neutron Calico driver, and this is wrong. so normally the mac is taken directly form the neuton port rather then the detial sub filed in the neutron port. so yes this does look incorrect in that regargd however this deviation was called out in the orgihnal commit

https://opendev.org/openstack/nova/commit/e0bca279d53f866d17834cdee025cda819...

...
""" VIF_TYPE_TAP supports a 'mac_address' key, in the VIF details dict, which allows Neutron to specify the MAC address for the host end of the TAP interface.

"""

so it looks like the calico driver was written to intentially have the ablity to have a different back for the host side of the tap device that is not the same as the guest side.

https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo...

...
however does indeed show that it was hard coded on the networkign-calico side

Indeed. I can't remember why we made those choices 12 years ago, but yes, the effect is that we have a hardcoded MAC on the host side of every guest interface, and that it's the same in every case.

What exactly do you see as being wrong about that, Sean?

well for one you should generally not have 2 netdevs with the same mac in a given network namespace i know that you can do that but it breaks l2 routing.

claico is a l3 network stack so it may not care but its generally bad practice to do.

in any case form a nova perspective the mac adress that we use for the vm should be specified by neutron for ovs we are usign a single mac (the neutron port mac) for the interface added to the vm and the tap on the host added to ovs.

for calico we are using the port mac for the virtio-net interface precedented to the guest and the hard coded mac for the tap interface on the host and nova is just using the value generated by the calico mech driver.

so if this is wrogn we shoudl fix the mech driver.

if there is no reason for them to be different we could consider remvoing the supprot for that in nova but that may have upgrade impacts it woudl be better to update the mech driver to have it set the same value in both places first and then update nova in a later release.

...
so this seams to have only been workign becasue libvirt was overriaded the mac.

Can you expand on that? What does "only been working" mean, and what do you mean by libvirt overriding the MAC?

zhang asserted that "libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side" if that is the desired bevhior then we need to update the mech driver for calico to generate those two macs. nova should not have logic to do that transformation form fe to fa

...
> So what happens before (with `managed=yes`) is that even though the mac address is wrongly updated, libvirt will rewrite the mac address later when Nova calls the migration API, so we won’t hit this issue. What happens now when we migrate the VM with the updated `managed=no` is that:

Can you expand on "before (with `managed=yes`)" ? My understanding is that VIF_TYPE_TAP devices have never behaved successfully with `managed=yes`.

this was imprecise nova did not set managed to any value historically. prior to https://github.com/libvirt/libvirt/commit/a2ae3d299cf the tap was allowed to already exist and the default in libvirt was manged=yes i belive nova was not settign it to yes or not

https://review.opendev.org/c/openstack/nova/+/960284 updated nova to set ti to no.

to allow calico to work again. my understading form this thread is with older libvirt and the conig we specifed the mac of the tap was modifed V

...
> 1. When the VM is created first without `managed=no`, libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side. VM learns this, and it's pingable.

so if ^ is ture we shoudl restore that behviro but the crrect way to do tha tis to modify the mac defiend in the vif['details'] field to have the correct mac for the tap.

im not expressing any opion on what that mac value should be just that it is the responiblity fo the mech driver to set ti to a value that will work for both new boots and for live migration.

...
> 2. When the VM is migrated with the updated XML (including `managed=no`), libvirt will NOT overwrite the mac address for the tap interface. Thus, when Nova creates the TAP interface and sets the mac address of the tap interface on the hypervisor side with the hard-coded mac address (i.e., `00:61:fe:ed:ca:fe`), this change will be persisted after the live migration. > 3. When the VM is resumed, it doesn't know that the hypervisor side interface mac address has changed, and is still sending packets with the old mac address (i.e., `fe:xx:xx:xx:xx:xx`). Hypervisor sees that there is no matching mac address, and will drop the packet. Running `tcpdump`, we were able to see that the VM is answering ping packets, but the replies never go out of the hypervisor. > 4. At some point later, VM will send ARP to ask for the new mac address, and it will become pingable when it gets the answer. > > So the key point here is to make sure that the mac addresses on both the VM and the hypervisor side are still the same before & after live migration. This is also why we don't see a latency increase when migrating with the `managed=no` flag present in both the XML before & after - the mac addresses are the same. > > I came up with a small patch in Nova to “fix” this by reading the port's mac address and do the reverse of what libvirt is doing (i.e., I know the port’s mac address is `fa:xx:xx:xx:xx:xx`, I set the tap interface's mac address to be `fe:xx:xx:xx:xx:xx` when creating the tap interface [3]). well the actuall bug seams to be in networking caliclo its pretty clear from the nova code that the intent was for the mech driver to calulate and provide a diffent mac for the host but it was never implemented on the neutron side.

> But before I file a bug report, Sean, I would like to understand: > > 1. Given now we assume `managed=no`, should Nova take the responsibility of setting the mac address correctly? short term yes but the code you provided shows that it already is

https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir...

...
plug tap is using the value for the vif details field when creating the tap.

medium term |plug_tap is a legacy fucntion and this should be move to os-vif that was inteded to be doen many many years ago but no one got around to it. so now would be a | > 2. If so, we will need to re-evaluate all things that were done by libvirt with `managed=yes` before and make sure Nova will do them, since we changed the default to `managed=no` now. to be clear libivrt defautl in the past as also effecitlgy manged=no n that it previoslyu did nto delete the tap it may have modifed it (its mac) but they change the behvior changing the meaning of the xml we generated.

we now use manged=no so all tap creation and configurtion need to happen in nova or idally in os-vif i recnetly added tap creation logic to os-vif for ovs

https://github.com/openstack/os-vif/commit/eba8007607381736b23e0a0ac672981e7...

...
ideally next cycle we woudl add a vif_plug_tap plugin for all backened that use vif_type=tap which is calico and in teh past medionet but that nolonger maintied

> 3. Should Nova read from `vif['address']` for mac address, instead of `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)`? not form looking at the orginal commit the nova code was intentioally writen this way so that networking-calico coudl specify the host side of the tap to have a diffeent mac i don't know why calico requires that but the orgianly code belived it was requried.

networking-calico is not allowed to assuem which nova hypervior is in used so it cannot deplend on the libvirt semantics

"libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side."

if that is the desired behavior then it need to specify that by setting `fe:xx:xx:xx:xx:xx` in the the vif details.

vif['address'] is the mac that shoudl be visable to the guest not the mac of the port on the host.

> > And for Nell: > > 1. Should networking-calico be modified to have `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)` field to reflect the actual mac address of the port? it shoudl be the mac of the host tap vif['address']is the mac of the guest interface and neutron port.

I'm happy to look at changes here, but at the moment I'm afraid I'm still at the stage of trying to understand the current analysis.

yep i think thats where most of us are. trying to understand what the intened behvior is, what it was in the past and how to align the two going forward. as it stands i do not see a bug on the nova side but the fact the mac apprease to be hardcoded on the networking-calico side for the host tap does look like a bug.

unfortunately i dont see anything that indicate why it was hard coded.

...
Best wishes - Nell

> > Thank you for helping out! > > Best regards, > Zhan Zhang > > [0]:

https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#...

...
> [1]: https://github.com/libvirt/libvirt/commit/a2ae3d299cf > [2]:

https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/privsep/lin...

...
> [3]:

https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir...

...
> [4]:

https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo...

...
>

I found the commit that introduced it, in https://github.com/projectcalico/calico. But sadly no useful explanation - bad me. ``` ed8ad15d8f69650d38c759f054328ccd263f62a4 Author: Neil Jerram <Neil.Jerram@metaswitch.com> AuthorDate: Fri May 22 23:01:15 2015 +0100 Commit: Cory Benfield <cory.benfield@metaswitch.com> CommitDate: Wed May 27 10:06:09 2015 +0100 Parent: 62536ca420 Version 0.21 Contained: auto-pick-of-#11767-origin-release-v3.30 auto-pick-of-#11767-origin-release-v3.31 bgpfilter-enhancements calicoctl-st-update e2e-test-moves get-all-hang hep-profile-label live-migration-monitor live-migration-poc modern openstack-v3.30-pre-release-packaging routing-priority-config test-arp-ignore Follows: 0.21-felix (1) Precedes: 0.22-felix (65) Neutron driver: specify fixed MAC address for Calico TAP interfaces 1 file changed, 2 insertions(+), 1 deletion(-) calico/openstack/mech_calico.py | 3 ++- modified calico/openstack/mech_calico.py @@ -123,7 +123,8 @@ class CalicoMechanismDriver(mech_agent.SimpleAgentMechanismDriverBase): super(CalicoMechanismDriver, self).__init__( constants.AGENT_TYPE_DHCP, 'tap', - {'port_filter': True}) + {'port_filter': True, + 'mac_address': '00:61:fe:ed:ca:fe'}) # Initialize fields for the database object and transport. We will # initialize these properly when we first need them. ``` So, I could run experiments to revert that, and patch Nova to use `vif['address']`, and see what breaks, if anything. But - stepping back - what is the problem we are trying to address here? It looks like there were some emails from Zhan that didn't go to the list, so I'm not sure I have the complete picture. Best wishes - Nell

Zhan Zhang

5:13 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

My apologies, I'm using the web UI for the communication here and I'm not sure how to reply all here... Please try this link: https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.... I have an internal patch on Nova to temporarily fix this by using `vif['address']` and do `fa` -> `fe` and it works as expected (i.e., no "increase in downtime"), but as mentioned by Sean, the ultimate fix should be on networking-calico.

Sean Mooney

5:36 p.m.

New subject: [Nova][Neutron][calico] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

On 09/02/2026 16:04, Nell Jerram wrote:

...

On Mon, Feb 9, 2026 at 3:19 PM Sean Mooney <smooney@redhat.com> wrote:

On 09/02/2026 14:53, Nell Jerram wrote: > On Mon, Feb 9, 2026 at 11:25 AM Sean Mooney <smooney@redhat.com> wrote: > > > > On 05/02/2026 19:59, Zhan Zhang wrote: > > On the Nova side, it actually does what libvirt does when the > tap interface is created [2] (i.e., setting the mac address + > bringing the interface up). However, the mac address that the > function gets is from the vif's port binding's "mac_address" field > [3], which is hard-coded when using networking-calico - the > Neutron Calico driver, and this is wrong. > so normally the mac is taken directly form the neuton port rather > then > the detial sub filed in the neutron port. > so yes this does look incorrect in that regargd however this > deviation > was called out in the orgihnal commit > https://opendev.org/openstack/nova/commit/e0bca279d53f866d17834cdee025cda819... > > """ > VIF_TYPE_TAP supports a 'mac_address' key, in the VIF details dict, > which allows Neutron to specify the MAC address for the host end of > the TAP interface. > > """ > > so it looks like the calico driver was written to intentially have > the > ablity to have a different back for the host side > of the tap device that is not the same as the guest side. > > https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo... > however does indeed show that it was hard coded on the > networkign-calico side > > > Indeed. I can't remember why we made those choices 12 years ago, but > yes, the effect is that we have a hardcoded MAC on the host side of > every guest interface, and that it's the same in every case. > > What exactly do you see as being wrong about that, Sean? well for one you should generally not have 2 netdevs with the same mac in a given network namespace i know that you can do that but it breaks l2 routing.

claico is a l3 network stack so it may not care but its generally bad practice to do.

in any case form a nova perspective the mac adress that we use for the vm should be specified by neutron for ovs we are usign a single mac (the neutron port mac) for the interface added to the vm and the tap on the host added to ovs.

for calico we are using the port mac for the virtio-net interface precedented to the guest and the hard coded mac for the tap interface on the host and nova is just using the value generated by the calico mech driver.

so if this is wrogn we shoudl fix the mech driver.

if there is no reason for them to be different we could consider remvoing the supprot for that in nova but that may have upgrade impacts it woudl be better to update the mech driver to have it set the same value in both places first and then update nova in a later release.

> > so this seams to have only been workign becasue libvirt was > overriaded the mac. > > > Can you expand on that? What does "only been working" mean, and what > do you mean by libvirt overriding the MAC? zhang asserted that "libvirt will set the mac address of the tap interface with `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on the VM side" if that is the desired bevhior then we need to update the mech driver for calico to generate those two macs. nova should not have logic to do that transformation form fe to fa

> > So what happens before (with `managed=yes`) is that even > though the mac address is wrongly updated, libvirt will rewrite > the mac address later when Nova calls the migration API, so we > won’t hit this issue. What happens now when we migrate the VM with > the updated `managed=no` is that: > > > Can you expand on "before (with `managed=yes`)" ? My understanding is > that VIF_TYPE_TAP devices have never behaved successfully with > `managed=yes`. this was imprecise nova did not set managed to any value historically. prior to https://github.com/libvirt/libvirt/commit/a2ae3d299cf the tap was allowed to already exist and the default in libvirt was manged=yes i belive nova was not settign it to yes or not

https://review.opendev.org/c/openstack/nova/+/960284 updated nova to set ti to no.

to allow calico to work again. my understading form this thread is with older libvirt and the conig we specifed the mac of the tap was modifed V

> > > 1. When the VM is created first without `managed=no`, libvirt > will set the mac address of the tap interface with > `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` > on the VM side. VM learns this, and it's pingable. > so if ^ is ture we shoudl restore that behviro but the crrect way to do tha tis to modify the mac defiend in the vif['details'] field to have the correct mac for the tap.

im not expressing any opion on what that mac value should be just that it is the responiblity fo the mech driver to set ti to a value that will work for both new boots and for live migration.

> > 2. When the VM is migrated with the updated XML (including > `managed=no`), libvirt will NOT overwrite the mac address for the > tap interface. Thus, when Nova creates the TAP interface and sets > the mac address of the tap interface on the hypervisor side with > the hard-coded mac address (i.e., `00:61:fe:ed:ca:fe`), this > change will be persisted after the live migration. > > 3. When the VM is resumed, it doesn't know that the hypervisor > side interface mac address has changed, and is still sending > packets with the old mac address (i.e., `fe:xx:xx:xx:xx:xx`). > Hypervisor sees that there is no matching mac address, and will > drop the packet. Running `tcpdump`, we were able to see that the > VM is answering ping packets, but the replies never go out of the > hypervisor. > > 4. At some point later, VM will send ARP to ask for the new mac > address, and it will become pingable when it gets the answer. > > > > So the key point here is to make sure that the mac addresses on > both the VM and the hypervisor side are still the same before & > after live migration. This is also why we don't see a latency > increase when migrating with the `managed=no` flag present in both > the XML before & after - the mac addresses are the same. > > > > I came up with a small patch in Nova to “fix” this by reading > the port's mac address and do the reverse of what libvirt is doing > (i.e., I know the port’s mac address is `fa:xx:xx:xx:xx:xx`, I set > the tap interface's mac address to be `fe:xx:xx:xx:xx:xx` when > creating the tap interface [3]). > well the actuall bug seams to be in networking caliclo > its pretty clear from the nova code that the intent was for the mech > driver to calulate and provide a diffent mac for the host but it > was > never implemented > on the neutron side. > > > But before I file a bug report, Sean, I would like to understand: > > > > 1. Given now we assume `managed=no`, should Nova take the > responsibility of setting the mac address correctly? > short term yes but the code you provided shows that it already is > > https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir... > > > plug tap is using the value for the vif details field when > creating the tap. > > medium term |plug_tap is a legacy fucntion and this should be move > to os-vif > that was inteded to be doen many many years ago but no one got > around to it. > so now would be a | > > 2. If so, we will need to re-evaluate all things that were done > by libvirt with `managed=yes` before and make sure Nova will do > them, since we changed the default to `managed=no` now. > to be clear libivrt defautl in the past as also effecitlgy > manged=no n > that it previoslyu did nto delete the tap it may have modifed it (its > mac) but they change the behvior > changing the meaning of the xml we generated. > > we now use manged=no so all tap creation and configurtion need to > happen > in nova or idally in os-vif > i recnetly added tap creation logic to os-vif for ovs > https://github.com/openstack/os-vif/commit/eba8007607381736b23e0a0ac672981e7... > ideally next cycle we woudl add a vif_plug_tap plugin for all > backened > that use vif_type=tap which is calico and in teh past medionet but > that > nolonger maintied > > > > 3. Should Nova read from `vif['address']` for mac address, > instead of > `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)`? > not form looking at the orginal commit > the nova code was intentioally writen this way so that > networking-calico > coudl specify the host side of the tap to have a diffeent mac > i don't know why calico requires that but the orgianly code > belived it > was requried. > > networking-calico is not allowed to assuem which nova hypervior is in > used so it cannot deplend on the libvirt semantics > > "libvirt will set the mac address of the tap interface with > `fe:xx:xx:xx:xx:xx` on the hypervisor side and `fa:xx:xx:xx:xx:xx` on > the VM side." > > if that is the desired behavior then it need to specify that by > setting > `fe:xx:xx:xx:xx:xx` in the the vif details. > > vif['address'] is the mac that shoudl be visable to the guest not > the mac of the port on the host. > > > > > And for Nell: > > > > 1. Should networking-calico be modified to have > `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)` > field to reflect the actual mac address of the port? > it shoudl be the mac of the host tap vif['address']is the mac of the > guest interface and neutron port. > > > I'm happy to look at changes here, but at the moment I'm afraid I'm > still at the stage of trying to understand the current analysis. yep i think thats where most of us are. trying to understand what the intened behvior is, what it was in the past and how to align the two going forward. as it stands i do not see a bug on the nova side but the fact the mac apprease to be hardcoded on the networking-calico side for the host tap does look like a bug.

unfortunately i dont see anything that indicate why it was hard coded. > > Best wishes - Nell > > > > > Thank you for helping out! > > > > Best regards, > > Zhan Zhang > > > > [0]: > https://gitlab.com/libvirt/libvirt/-/blob/v10.0.0/src/qemu/qemu_interface.c#... > > [1]: https://github.com/libvirt/libvirt/commit/a2ae3d299cf > > [2]: > https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/privsep/lin... > > [3]: > https://opendev.org/openstack/nova/src/branch/stable/2025.2/nova/virt/libvir... > > [4]: > https://github.com/projectcalico/calico/blob/v3.31.3/networking-calico/netwo... > > >

I found the commit that introduced it, in https://github.com/projectcalico/calico. But sadly no useful explanation - bad me.

``` ed8ad15d8f69650d38c759f054328ccd263f62a4 Author: Neil Jerram <Neil.Jerram@metaswitch.com> AuthorDate: Fri May 22 23:01:15 2015 +0100 Commit: Cory Benfield <cory.benfield@metaswitch.com> CommitDate: Wed May 27 10:06:09 2015 +0100

Parent: 62536ca420 Version 0.21 Contained: auto-pick-of-#11767-origin-release-v3.30 auto-pick-of-#11767-origin-release-v3.31 bgpfilter-enhancements calicoctl-st-update e2e-test-moves get-all-hang hep-profile-label live-migration-monitor live-migration-poc modern openstack-v3.30-pre-release-packaging routing-priority-config test-arp-ignore Follows: 0.21-felix (1) Precedes: 0.22-felix (65)

Neutron driver: specify fixed MAC address for Calico TAP interfaces

1 file changed, 2 insertions(+), 1 deletion(-) calico/openstack/mech_calico.py | 3 ++-

modified calico/openstack/mech_calico.py @@ -123,7 +123,8 @@ class CalicoMechanismDriver(mech_agent.SimpleAgentMechanismDriverBase): super(CalicoMechanismDriver, self).__init__( constants.AGENT_TYPE_DHCP, 'tap', - {'port_filter': True}) + {'port_filter': True, + 'mac_address': '00:61:fe:ed:ca:fe'})

# Initialize fields for the database object and transport. We will # initialize these properly when we first need them. ```

So, I could run experiments to revert that, and patch Nova to use `vif['address']`, and see what breaks, if anything. But - stepping back - what is the problem we are trying to address here? It looks like there were some emails from Zhan that didn't go to the list, so I'm not sure I have the complete picture. nova is use vif['address'] in the libvirt xml for the guest side mac and is only using `vif['details'].get(network_model.VIF_DETAILS_TAP_MAC_ADDRESS)` for the tap device so you shoudl not need to patch nova. just mech_calico.py

Best wishes - Nell

Age (days ago)

Last active (days ago)

List overview

Download

11 comments

4 participants

participants (4)

Chang Xue
Nell Jerram
Sean Mooney
Zhan Zhang

[Nova] Live migration failed / VM has latency with patch to add managed='no' flag to libvirt XML

tags

participants (4)