[openstack-dev] [nova] [neutron] PCI pass-through network support
yunhong.jiang at linux.intel.com
Fri Jan 17 01:36:28 UTC 2014
On Thu, 2014-01-16 at 01:28 +0100, Ian Wells wrote:
> To clarify a couple of Robert's points, since we had a conversation
> On 15 January 2014 23:47, Robert Li (baoli) <baoli at cisco.com> wrote:
> --- do we agree that BDF address (or device id, whatever
> you call it), and node id shouldn't be used as attributes in
> defining a PCI flavor?
> Note that the current spec doesn't actually exclude it as an option.
> It's just an unwise thing to do. In theory, you could elect to define
> your flavors using the BDF attribute but determining 'the card in this
> slot is equivalent to all the other cards in the same slot in other
> machines' is probably not the best idea... We could lock it out as an
> option or we could just assume that administrators wouldn't be daft
> enough to try.
> * the compute node needs to know the PCI flavor.
> - to support live migration, we need to use
> it to create network xml
> I didn't understand this at first and it took me a while to get what
> Robert meant here.
> This is based on Robert's current code for macvtap based live
> migration. The issue is that if you wish to migrate a VM and it's
> tied to a physical interface, you can't guarantee that the same
> physical interface is going to be used on the target machine, but at
> the same time you can't change the libvirt.xml as it comes over with
> the migrating machine. The answer is to define a network and refer
> out to it from libvirt.xml. In Robert's current code he's using the
> group name of the PCI devices to create a network containing the list
> of equivalent devices (those in the group) that can be macvtapped.
> Thus when the host migrates it will find another, equivalent,
> interface. This falls over in the use case under consideration where
> a device can be mapped using more than one flavor, so we have to
> discard the use case or rethink the implementation.
> There's a more complex solution - I think - where we create a
> temporary network for each macvtap interface a machine's going to use,
> with a name based on the instance UUID and port number, and containing
> the device to map. Before starting the migration we would create a
> replacement network containing only the new device on the target host;
> migration would find the network from the name in the libvirt.xml, and
> the content of that network would behave identically. We'd be
> creating libvirt networks on the fly and a lot more of them, and we'd
> need decent cleanup code too ('when freeing a PCI device, delete any
> network it's a member of'), so it all becomes a lot more hairy.
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
Ian/Robert, below is my understanding to the method Robet want to use,
am I right?
a) Define a libvirt network as "Using a macvtap "direct" connection"
section at "http://libvirt.org/formatnetwork.html . For example, like
<name> group_name1 </name>
b) When assign SRIOV NIC devices to an instance, as in "Assignment from
a pool of SRIOV VFs in a libvirt <network> definition" section in
http://wiki.libvirt.org/page/Networking#PCI_Passthrough_of_host_network_devices , use libvirt network definition group_name1. For example, like followed one:
If my understanding is correct, then I have something unclear yet:
a) How will the libvirt create the libvirt network (i.e. libvirt network
group_name1)? Will it has be created when compute boot up, or it will be
created before instance creation? I suppose per Robert's design, it's
created when compute node is up, am I right?
b) If all the interface are used up by instance, what will happen.
Considering that 4 interface allocated to the group_name1 libvirt
network, and user try to migrate 6 instance with 'group_name1' network,
what will happen?
And below is my comments:
a) Yes, this is in fact different with the current nova PCI support
philosophy. Currently we assume Nova owns the devices, manage the device
assignment to each instance. While in such situation, libvirt network is
in fact another layer of PCI device management layer (although very
b) This also remind me that possibly other VMM like XenAPI has special
requirement and we need input/confirmation from them also.
As how to resolve the issue, I think there are several solution:
a) Create one libvirt network for each SRIOV NIC assigned to each
instance dynamic, i.e. the libvirt network always has only one interface
included, it may be static created or dynamical created. This solution
in fact removes the allocation functionality of the libvirt network and
leaves only the configuration functionality.
b) Change Nova PCI to support a special type of PCI device attribute
(like the PCI group). For these PCI attributes , the PCI device
scheduler will match a PCI devices only if the attributes is specified
clearly in the PCI flavor.
Below is an example:
considering two PCI SRIOV device:
Dev1: BDF=00:0.1, vendor_id=1, device_id=1, group=grp1
Dev2: BDF=00:1.1, vendor_id=1, device_id=2
i.e. Dev2 has no group attributes are specified.
And we mark 'group' attribute as special attributes.
Considering follow flavors:
Flavor1: name=flv1, vendor_id=1
Flavor2: name=flv2, vendor_id=1, group=grp1
Flavor3: name=flv3, group=grp1.
The Dev1 will never be assigned to flv2.
This solution try to separate the devices managed by Nova exclusively
and devices managed by Nova/libvirt together.
More information about the OpenStack-dev