[openstack-dev] [Nova][Neutron] [Live Migration] Prevent invalid live migration instead of failing and setting instance to error state after porbinding failed

Kevin Benton kevin at benton.pub
Tue Apr 12 11:12:49 UTC 2016


We can't change the host_id until after the migration or it will break
l2pop other drivers that use the host as a location indicator (e.g. many
top of rack drivers do this to determine which switch port should be wired
up).

There is already a patch that went in to inform Neutron of the destination
host here for proactive DVR wiring: https://review.openstack.org/#/c/275420/
During this port update phase, we can validate the destination host is
'bindable' with the given port info and fail it otherwise. This should
block Nova from continuing.

However, we have to figure out how ML2 will know if something is
'bindable'. The only interface we have right now is bind_port. It is
possible that we can do a faked bind_port attempt using what the port
host_id would look like after migration. It's made clear in the ML2 driver
API that bind_port results may not be committed:
https://github.com/openstack/neutron/blob/4440297a2ff5a6893b748c2400048e840283c718/neutron/plugins/ml2/driver_api.py#L869

So the workflow would be something like:
* Nova calls Neutron port update with the destination host in the binding
details
* In ML2 port update, the destination host is placed into a copy of the
port in the host_id field and bind_port is called.
* If bind_port is unsuccessful, it fails the port update for Nova to
prevent migration.
* If bind_port is successful, the results of the port update are committed
(with the original host_id and the new host_id in the destination_host
field).
* Workflow continues as normal here.

So this heavily exploits the fact that bind_port is supposed to be
mutation-free in the ML2 driver API. We may encounter drivers that don't
follow this now, but they are already exposed to other bugs if they mutate
state so I think the onus would be on them to fix their driver.

Cheers,
Kevin Benton

On Tue, Apr 12, 2016 at 3:33 AM, Rossella Sblendido <rsblendido at suse.com>
wrote:

> On 04/12/2016 12:05 PM, Andreas Scheuring wrote:
> > Hi together,
> > I wanted to start discussion about Live Migration problem that currently
> exists in the nova neutron communication.
> >
> > Basics Live Migration and Nova - Neutron communication
> > ------------------------------------------------------
> > On a high level, Nova Live Migration happens in 3 stages. (--> is what's
> happening from network perspective)
> > #1 pre_live_migration
> >    --> libvirtdriver: nova plugs the interface (for ovs hybrid sets up
> the linuxbridge + veth and connects it to br-int)
> > #2 live_migration_operation
> >    --> instance is being migrated (using libvirt with the domain.xml
> that is currently active on the migration source)
> > #3 post_live_migration
> >    --> binding:host_id is being updated for the port
> >    --> libvirtdriver: domain.xml is being regenerated
> > More details can be found here [1]
> >
> > The problem - portbinding fails
> > -------------------------------
> > With this flow, ML2 portbinding is triggered in post_live_migration. At
> this point, the instance has already been migrated and is active on the
> migration destination.
> > Part of the port-binding is happening in the mechanism drivers, where
> the vif information for the port (vif-type, vif-details,..) is being
> updated.
> > If this portbinding fails, port will get the binding:vif_type
> "binding_failed".
> > After that the nova libvirt driver starts generating the domain xml
> again to persist it. Part of this generation is also generating the
> interface definition.
> > This fails as the vif_type is "binding_failed". Nova will set the
> instance to error state. --> There is no rollback, as it's already too late!
> >
> > Just a remark: There is no explicit check for the vif_type
> binding_failed. I have the feeling that it (luckily) fails by accident when
> generating the xml.
> >
> > --> Ideally we would trigger the portbinding before the migration
> started - in pre_live_migration. Then, if binding fails, we could abort
> migration before it even started. The instance would still be
> > active and fully functional on the source host. I have a WIP patchset
> out proposing this change [2]
> >
> >
> > The impact
> > ----------
> > Patchset [2] propose updating the host_id already in pre_live_migration.
> > During migration, the port would already be owned by the migration
> target (although the guest is still active on the source)
> > Technically this works fine for all the reference implementations, but
> this could be a problem for some third party mech drivers, if they shut
> down the port on the source and activate it on the target - although
> instance is still on the source
> >
> > Any thoughts on this?
>
> +1 on this anyway let's hear back from third party drivers maintainers.
>
> >
> >
> > Additional use cases that would be enabled with this change
> > -----------------------------------------------------------
> > When updating the host_id in pre_live_migration, we could modify the
> domain.xml with the new vif information before live migration (see patch
> [2] and nova spec [4]).
> > This enables the following use cases
> >
> > #1 Live Migration between nodes that run different l2 agents
> >    E.g. you could migrate a instance from an ovs node to an lb node and
> vice versa. This could be used as l2 agent transition strategy!
> > #2 Live Migration with macvtap agent
> >    It would enable the macvtap agent to live migrate instances between
> hosts, that use a different physical_interface_mapping. See bug [3]
> >
> > --> #2 is the use case that made me thinking about this whole topic....
> >
> > Potential other solutions
> > -------------------------
> > #1 Have something like simultaneous portbinding - On migration, a port
> is bound to 2 hosts (like a dvr port can today).
> > Therefore some database refactorings would be required (work has already
> been started in the DVR context [7])
> > And the Rest API would need to be changed in a way, that there's not a
> single binding, but a list of bindings returned. Of course also create &
> update that list.
> >
>
> I don't like this one. This would require lots of code changes and I am
> not sure it would solve the problem completely. The model of having a
> port bound to two hosts just because it's migrating, it's confusing.
>
>
> > #2 execute portbinding without saving it to db
> > we could also introduce a new api( like update port, with live migration
> flag), that would run through the portbinding code and would return the port
> > information for the target node, but would not persist this information.
> Son on port-show you would still get the old information. Update would only
> happen if the migration flag is not present (in post_live_migration like
> today)
> > Alternatively the generated protbidning could be stored in the port
> context and be used on the final port_update be instead of running through
> all the code pathes again.
> >
>
> Another possible solution is to apply the same strategy we use for
> instance creation. Nova should wait to get a confirmation from Neutron
> before declaring the migration successful.
>
> cheers,
>
> Rossella
>
> >
> > Other efforts in the area nova neutron live migration
> > -----------------------------------------------------
> > Just for reference, those are the other activities around nova-neutron
> live migration I'm aware of. But non of them is related to this IMO.
> >
> > #1 ovs-hybrid plug wait for vi-plug event before doing live migration
> > see patches [5]
> > --> on nova plug, creates the linuxbridge and the veth pair and plugs it
> into the br-int. This plug is being detected by the ovs agent, which then
> reports the device as up
> > which again triggers this vif-plug event. This does not solve the
> problem as portbinding is not involved anyhow. This patch can also not be
> used for lb, ovs normal and macvtap,
> > as for those vif-types libvirt sets up the device that the agent is
> looking for. But this happens during live migration operation.
> >
> > #2 Implement setup_networks_on_host for Neutron networks
> > Notification that Neutron sets up a DVR router attachment on the target
> node
> > see patch [6] + related patches
> >
> > #3 I also know the midonet faces some challenges during nova plug
> > but this is also a separate topic
> >
> >
> >
> > Any discussion / input would be helpful, thanks a lot!
> >
> >
> > [1]
> https://review.openstack.org/#/c/274097/6/doc/source/devref/live_migration.rst
> > [2] https://review.openstack.org/297100
> > [3] https://bugs.launchpad.net/neutron/+bug/1550400
> > [4] https://review.openstack.org/301090
> > [5] https://review.openstack.org/246898 &
> https://review.openstack.org/246910
> > [6] https://review.openstack.org/275073
> > [7]  https://bugs.launchpad.net/neutron/+bug/1367391
> >
> >
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160412/3a2e390a/attachment.html>


More information about the OpenStack-dev mailing list