[openstack-dev] [Nova][Neutron] [Live Migration] Prevent invalid live migration instead of failing and setting instance to error state after porbinding failed

Andreas Scheuring scheuran at linux.vnet.ibm.com
Tue Apr 12 10:05:16 UTC 2016

Hi together, 
I wanted to start discussion about Live Migration problem that currently exists in the nova neutron communication.

Basics Live Migration and Nova - Neutron communication
On a high level, Nova Live Migration happens in 3 stages. (--> is what's happening from network perspective)
#1 pre_live_migration
   --> libvirtdriver: nova plugs the interface (for ovs hybrid sets up the linuxbridge + veth and connects it to br-int)
#2 live_migration_operation
   --> instance is being migrated (using libvirt with the domain.xml that is currently active on the migration source)
#3 post_live_migration
   --> binding:host_id is being updated for the port
   --> libvirtdriver: domain.xml is being regenerated  
More details can be found here [1]

The problem - portbinding fails
With this flow, ML2 portbinding is triggered in post_live_migration. At this point, the instance has already been migrated and is active on the migration destination.
Part of the port-binding is happening in the mechanism drivers, where the vif information for the port (vif-type, vif-details,..) is being updated.
If this portbinding fails, port will get the binding:vif_type "binding_failed".
After that the nova libvirt driver starts generating the domain xml again to persist it. Part of this generation is also generating the interface definition. 
This fails as the vif_type is "binding_failed". Nova will set the instance to error state. --> There is no rollback, as it's already too late!

Just a remark: There is no explicit check for the vif_type binding_failed. I have the feeling that it (luckily) fails by accident when generating the xml.

--> Ideally we would trigger the portbinding before the migration started - in pre_live_migration. Then, if binding fails, we could abort migration before it even started. The instance would still be
active and fully functional on the source host. I have a WIP patchset out proposing this change [2]

The impact
Patchset [2] propose updating the host_id already in pre_live_migration. 
During migration, the port would already be owned by the migration target (although the guest is still active on the source)
Technically this works fine for all the reference implementations, but this could be a problem for some third party mech drivers, if they shut down the port on the source and activate it on the target - although instance is still on the source

Any thoughts on this?

Additional use cases that would be enabled with this change
When updating the host_id in pre_live_migration, we could modify the domain.xml with the new vif information before live migration (see patch [2] and nova spec [4]).
This enables the following use cases

#1 Live Migration between nodes that run different l2 agents
   E.g. you could migrate a instance from an ovs node to an lb node and vice versa. This could be used as l2 agent transition strategy!
#2 Live Migration with macvtap agent
   It would enable the macvtap agent to live migrate instances between hosts, that use a different physical_interface_mapping. See bug [3]

--> #2 is the use case that made me thinking about this whole topic....

Potential other solutions
#1 Have something like simultaneous portbinding - On migration, a port is bound to 2 hosts (like a dvr port can today).
Therefore some database refactorings would be required (work has already been started in the DVR context [7])
And the Rest API would need to be changed in a way, that there's not a single binding, but a list of bindings returned. Of course also create & update that list.

#2 execute portbinding without saving it to db
we could also introduce a new api( like update port, with live migration flag), that would run through the portbinding code and would return the port
information for the target node, but would not persist this information. Son on port-show you would still get the old information. Update would only happen if the migration flag is not present (in post_live_migration like today)
Alternatively the generated protbidning could be stored in the port context and be used on the final port_update be instead of running through all the code pathes again.

Other efforts in the area nova neutron live migration
Just for reference, those are the other activities around nova-neutron live migration I'm aware of. But non of them is related to this IMO.

#1 ovs-hybrid plug wait for vi-plug event before doing live migration
see patches [5]
--> on nova plug, creates the linuxbridge and the veth pair and plugs it into the br-int. This plug is being detected by the ovs agent, which then reports the device as up
which again triggers this vif-plug event. This does not solve the problem as portbinding is not involved anyhow. This patch can also not be used for lb, ovs normal and macvtap, 
as for those vif-types libvirt sets up the device that the agent is looking for. But this happens during live migration operation.

#2 Implement setup_networks_on_host for Neutron networks
Notification that Neutron sets up a DVR router attachment on the target node
see patch [6] + related patches

#3 I also know the midonet faces some challenges during nova plug
but this is also a separate topic

Any discussion / input would be helpful, thanks a lot!

[1] https://review.openstack.org/#/c/274097/6/doc/source/devref/live_migration.rst
[2] https://review.openstack.org/297100
[3] https://bugs.launchpad.net/neutron/+bug/1550400
[4] https://review.openstack.org/301090
[5] https://review.openstack.org/246898 & https://review.openstack.org/246910
[6] https://review.openstack.org/275073
[7]  https://bugs.launchpad.net/neutron/+bug/1367391

Andreas (IRC: scheuran)

More information about the OpenStack-dev mailing list