[openstack-dev] [Nova][Neutron] [Live Migration] Prevent invalid live migration instead of failing and setting instance to error state after porbinding failed
rsblendido at suse.com
Tue Apr 12 10:33:14 UTC 2016
On 04/12/2016 12:05 PM, Andreas Scheuring wrote:
> Hi together,
> I wanted to start discussion about Live Migration problem that currently exists in the nova neutron communication.
> Basics Live Migration and Nova - Neutron communication
> On a high level, Nova Live Migration happens in 3 stages. (--> is what's happening from network perspective)
> #1 pre_live_migration
> --> libvirtdriver: nova plugs the interface (for ovs hybrid sets up the linuxbridge + veth and connects it to br-int)
> #2 live_migration_operation
> --> instance is being migrated (using libvirt with the domain.xml that is currently active on the migration source)
> #3 post_live_migration
> --> binding:host_id is being updated for the port
> --> libvirtdriver: domain.xml is being regenerated
> More details can be found here 
> The problem - portbinding fails
> With this flow, ML2 portbinding is triggered in post_live_migration. At this point, the instance has already been migrated and is active on the migration destination.
> Part of the port-binding is happening in the mechanism drivers, where the vif information for the port (vif-type, vif-details,..) is being updated.
> If this portbinding fails, port will get the binding:vif_type "binding_failed".
> After that the nova libvirt driver starts generating the domain xml again to persist it. Part of this generation is also generating the interface definition.
> This fails as the vif_type is "binding_failed". Nova will set the instance to error state. --> There is no rollback, as it's already too late!
> Just a remark: There is no explicit check for the vif_type binding_failed. I have the feeling that it (luckily) fails by accident when generating the xml.
> --> Ideally we would trigger the portbinding before the migration started - in pre_live_migration. Then, if binding fails, we could abort migration before it even started. The instance would still be
> active and fully functional on the source host. I have a WIP patchset out proposing this change 
> The impact
> Patchset  propose updating the host_id already in pre_live_migration.
> During migration, the port would already be owned by the migration target (although the guest is still active on the source)
> Technically this works fine for all the reference implementations, but this could be a problem for some third party mech drivers, if they shut down the port on the source and activate it on the target - although instance is still on the source
> Any thoughts on this?
+1 on this anyway let's hear back from third party drivers maintainers.
> Additional use cases that would be enabled with this change
> When updating the host_id in pre_live_migration, we could modify the domain.xml with the new vif information before live migration (see patch  and nova spec ).
> This enables the following use cases
> #1 Live Migration between nodes that run different l2 agents
> E.g. you could migrate a instance from an ovs node to an lb node and vice versa. This could be used as l2 agent transition strategy!
> #2 Live Migration with macvtap agent
> It would enable the macvtap agent to live migrate instances between hosts, that use a different physical_interface_mapping. See bug 
> --> #2 is the use case that made me thinking about this whole topic....
> Potential other solutions
> #1 Have something like simultaneous portbinding - On migration, a port is bound to 2 hosts (like a dvr port can today).
> Therefore some database refactorings would be required (work has already been started in the DVR context )
> And the Rest API would need to be changed in a way, that there's not a single binding, but a list of bindings returned. Of course also create & update that list.
I don't like this one. This would require lots of code changes and I am
not sure it would solve the problem completely. The model of having a
port bound to two hosts just because it's migrating, it's confusing.
> #2 execute portbinding without saving it to db
> we could also introduce a new api( like update port, with live migration flag), that would run through the portbinding code and would return the port
> information for the target node, but would not persist this information. Son on port-show you would still get the old information. Update would only happen if the migration flag is not present (in post_live_migration like today)
> Alternatively the generated protbidning could be stored in the port context and be used on the final port_update be instead of running through all the code pathes again.
Another possible solution is to apply the same strategy we use for
instance creation. Nova should wait to get a confirmation from Neutron
before declaring the migration successful.
> Other efforts in the area nova neutron live migration
> Just for reference, those are the other activities around nova-neutron live migration I'm aware of. But non of them is related to this IMO.
> #1 ovs-hybrid plug wait for vi-plug event before doing live migration
> see patches 
> --> on nova plug, creates the linuxbridge and the veth pair and plugs it into the br-int. This plug is being detected by the ovs agent, which then reports the device as up
> which again triggers this vif-plug event. This does not solve the problem as portbinding is not involved anyhow. This patch can also not be used for lb, ovs normal and macvtap,
> as for those vif-types libvirt sets up the device that the agent is looking for. But this happens during live migration operation.
> #2 Implement setup_networks_on_host for Neutron networks
> Notification that Neutron sets up a DVR router attachment on the target node
> see patch  + related patches
> #3 I also know the midonet faces some challenges during nova plug
> but this is also a separate topic
> Any discussion / input would be helpful, thanks a lot!
>  https://review.openstack.org/#/c/274097/6/doc/source/devref/live_migration.rst
>  https://review.openstack.org/297100
>  https://bugs.launchpad.net/neutron/+bug/1550400
>  https://review.openstack.org/301090
>  https://review.openstack.org/246898 & https://review.openstack.org/246910
>  https://review.openstack.org/275073
>  https://bugs.launchpad.net/neutron/+bug/1367391
More information about the OpenStack-dev