[openstack-dev] [nova] [neutron] PCI pass-through network support

Irena Berezovsky irenab at mellanox.com
Sun Jan 19 07:15:31 UTC 2014

Hi Robert, Yonhong,
Although network XML solution (option 1) is very elegant, it has one major disadvantage. As Robert mentioned, the disadvantage of the network XML is the inability to know what SR-IOV PCI device was actually allocated. When neutron is responsible to set networking configuration, manage admin status, set security groups, it should be able to identify the SR-IOV PCI device to apply configuration. Within current libvirt Network XML implementation, it does not seem possible.
Between option (2) and (3), I do not have any preference, it should be as simple as possible.
Option (3) that I raised can be achieved by renaming the network interface of Virtual Function via 'ip link set  name'. Interface logical name can be based on neutron port UUID. This will  allow neutron to discover devices, if backend plugin requires it. Once VM is migrating, suitable Virtual Function on the target node should be allocated, and then its corresponding network interface should be renamed to same logical name. This can be done without system rebooting. Still need to check how the Virtual Function corresponding network interface can be returned to its original name once is not used anymore as VM vNIC.


-----Original Message-----
From: Jiang, Yunhong [mailto:yunhong.jiang at intel.com] 
Sent: Friday, January 17, 2014 9:06 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Robert, thanks for your long reply. Personally I'd prefer option 2/3 as it keep Nova the only entity for PCI management.

Glad you are ok with Ian's proposal and we have solution to resolve the libvirt network scenario in that framework.


> -----Original Message-----
> From: Robert Li (baoli) [mailto:baoli at cisco.com]
> Sent: Friday, January 17, 2014 7:08 AM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network 
> support
> Yunhong,
> Thank you for bringing that up on the live migration support. In 
> addition to the two solutions you mentioned, Irena has a different 
> solution. Let me put all the them here again:
>     1. network xml/group based solution.
>        In this solution, each host that supports a provider 
> net/physical net can define a SRIOV group (it's hard to avoid the term 
> as you can see from the suggestion you made based on the PCI flavor 
> proposal). For each SRIOV group supported on a compute node, A network 
> XML will be created the first time the nova compute service is running 
> on that node.
>         * nova will conduct scheduling, but not PCI device allocation
>         * it's a simple and clean solution, documented in libvirt as 
> the way to support live migration with SRIOV. In addition, a network 
> xml is nicely mapped into a provider net.
>     2. network xml per PCI device based solution
>        This is the solution you brought up in this email, and Ian 
> mentioned this to me as well. In this solution, a network xml is 
> created when A VM is created. the network xml needs to be removed once 
> the VM is removed. This hasn't been tried out as far as I  know.
>     3. interface xml/interface rename based solution
>        Irena brought this up. In this solution, the ethernet interface 
> name corresponding to the PCI device attached to the VM needs to be 
> renamed. One way to do so without requiring system reboot is to change 
> the udev rule's file for interface renaming, followed by a udev 
> reload.
> Now, with the first solution, Nova doesn't seem to have control over 
> or visibility of the PCI device allocated for the VM before the VM is 
> launched. This needs to be confirmed with the libvirt support and see 
> if such capability can be provided. This may be a potential drawback 
> if a neutron plugin requires detailed PCI device information for operation.
> Irena may provide more insight into this. Ideally, neutron shouldn't 
> need this information because the device configuration can be done by 
> libvirt invoking the PCI device driver.
> The other two solutions are similar. For example, you can view the 
> second solution as one way to rename an interface, or camouflage an 
> interface under a network name. They all require additional works 
> before the VM is created and after the VM is removed.
> I also agree with you that we should take a look at XenAPI on this.
> With regard to your suggestion on how to implement the first solution 
> with some predefined group attribute, I think it definitely can be 
> done. As I have pointed it out earlier, the PCI flavor proposal is 
> actually a generalized version of the PCI group. In other words, in 
> the PCI group proposal, we have one predefined attribute called PCI 
> group, and everything else works on top of that. In the PCI flavor 
> proposal, attribute is arbitrary. So certainly we can define a 
> particular attribute for networking, which let's temporarily call 
> sriov_group. But I can see with this idea of predefined attributes, 
> more of them will be required by different types of devices in the 
> future. I'm sure it will keep us busy although I'm not sure it's in a good way.
> I was expecting you or someone else can provide a practical deployment 
> scenario that would justify the flexibilities and the complexities.
> Although I'd prefer to keep it simple and generalize it later once a 
> particular requirement is clearly identified, I'm fine to go with it 
> if that's most of the folks want to do.
> --Robert
> On 1/16/14 8:36 PM, "yunhong jiang" <yunhong.jiang at linux.intel.com>
> wrote:
> >On Thu, 2014-01-16 at 01:28 +0100, Ian Wells wrote:
> >> To clarify a couple of Robert's points, since we had a conversation
> >> earlier:
> >> On 15 January 2014 23:47, Robert Li (baoli) <baoli at cisco.com> wrote:
> >>           ---  do we agree that BDF address (or device id, whatever
> >>         you call it), and node id shouldn't be used as attributes in
> >>         defining a PCI flavor?
> >>
> >>
> >> Note that the current spec doesn't actually exclude it as an option.
> >> It's just an unwise thing to do.  In theory, you could elect to 
> >> define your flavors using the BDF attribute but determining 'the 
> >> card in this slot is equivalent to all the other cards in the same 
> >> slot in other machines' is probably not the best idea...  We could 
> >> lock it out as an option or we could just assume that 
> >> administrators wouldn't be daft enough to try.
> >>
> >>
> >>                 * the compute node needs to know the PCI flavor.
> >>         [...]
> >>                           - to support live migration, we need to
> use
> >>         it to create network xml
> >>
> >>
> >> I didn't understand this at first and it took me a while to get 
> >> what Robert meant here.
> >>
> >> This is based on Robert's current code for macvtap based live 
> >> migration.  The issue is that if you wish to migrate a VM and it's 
> >> tied to a physical interface, you can't guarantee that the same 
> >> physical interface is going to be used on the target machine, but 
> >> at the same time you can't change the libvirt.xml as it comes over 
> >> with the migrating machine.  The answer is to define a network and 
> >> refer out to it from libvirt.xml.  In Robert's current code he's 
> >> using the group name of the PCI devices to create a network 
> >> containing the list of equivalent devices (those in the group) that can be macvtapped.
> >> Thus when the host migrates it will find another, equivalent, 
> >> interface.  This falls over in the use case under consideration 
> >> where a device can be mapped using more than one flavor, so we have 
> >> to discard the use case or rethink the implementation.
> >>
> >> There's a more complex solution - I think - where we create a 
> >> temporary network for each macvtap interface a machine's going to
> use,
> >> with a name based on the instance UUID and port number, and
> containing
> >> the device to map.  Before starting the migration we would create a 
> >> replacement network containing only the new device on the target
> host;
> >> migration would find the network from the name in the libvirt.xml, 
> >> and the content of that network would behave identically.  We'd be 
> >> creating libvirt networks on the fly and a lot more of them, and 
> >> we'd need decent cleanup code too ('when freeing a PCI device, 
> >> delete any network it's a member of'), so it all becomes a lot more hairy.
> >> _______________________________________________
> >> OpenStack-dev mailing list
> >> OpenStack-dev at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >Ian/Robert, below is my understanding to the method Robet want to 
> >use, am I right?
> >
> >a) Define a libvirt network as  "Using a macvtap "direct" connection"
> >section at "http://libvirt.org/formatnetwork.html . For example, like 
> >followed one:
> ><network>
> >        <name> group_name1 </name>
> >        <forward mode="bridge">
> >          <interface dev="eth20"/>
> >          <interface dev="eth21"/>
> >          <interface dev="eth22"/>
> >          <interface dev="eth23"/>
> >          <interface dev="eth24"/>
> >        </forward>
> >      </network>
> >
> >
> >b) When assign SRIOV NIC devices to an instance, as in "Assignment 
> >from a pool of SRIOV VFs in a libvirt <network> definition" section 
> >in 
> >http://wiki.libvirt.org/page/Networking#PCI_Passthrough_of_host_netw
> ork_de
> >vices , use libvirt network definition group_name1. For example, like 
> >followed one:
> >
> >  <interface type='network'>
> >    <source network='group_name1'>
> >  </interface>
> >
> >
> >If my understanding is correct, then I have something unclear yet:
> >a) How will the libvirt create the libvirt network (i.e. libvirt 
> >network group_name1)? Will it has be created when compute boot up, or 
> >it will
> be
> >created before instance creation? I suppose per Robert's design, it's 
> >created when compute node is up, am I right?
> >
> >b) If all the interface are used up by instance, what will happen.
> >Considering that 4 interface allocated to the group_name1 libvirt 
> >network, and user try to migrate 6 instance with 'group_name1' 
> >network, what will happen?
> >
> >And below is my comments:
> >
> >a) Yes, this is in fact different with the current nova PCI support 
> >philosophy. Currently we assume Nova owns the devices, manage the
> device
> >assignment to each instance. While in such situation, libvirt network 
> >is in fact another layer of PCI device management layer (although 
> >very
> >thin) !
> >
> >b) This also remind me that possibly other VMM like XenAPI has 
> >special requirement and we need input/confirmation from them also.
> >
> >
> >As how to resolve the issue, I think there are several solution:
> >
> >a) Create one libvirt network for each SRIOV NIC assigned to each 
> >instance dynamic, i.e. the libvirt network always has only one 
> >interface included, it may be static created or dynamical created. 
> >This solution in fact removes the allocation functionality of the 
> >libvirt network and leaves only the configuration functionality.
> >
> >b) Change Nova PCI to support a special type of PCI device attribute 
> >(like the PCI group). For these PCI attributes , the PCI device 
> >scheduler will match a PCI devices only if the attributes is 
> >specified clearly in the PCI flavor.
> >  Below is an example:
> >  considering two PCI SRIOV device:
> >	Dev1: BDF=00:0.1, vendor_id=1, device_id=1, group=grp1
> >	Dev2: BDF=00:1.1, vendor_id=1, device_id=2
> >    i.e. Dev2 has no group attributes are specified.
> >
> >   And we mark 'group' attribute as special attributes.
> >
> >   Considering follow flavors:
> >        Flavor1: name=flv1, vendor_id=1
> >	Flavor2: name=flv2, vendor_id=1, group=grp1
> >	Flavor3: name=flv3, group=grp1.
> >
> >   The Dev1 will never be assigned to flv2.
> >   This solution try to separate the devices managed by Nova 
> >exclusively and devices managed by Nova/libvirt together.
> >
> >Any idea?
> >
> >Thanks
> >--jyh
> >
> >
> >_______________________________________________
> >OpenStack-dev mailing list
> >OpenStack-dev at lists.openstack.org
> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org

More information about the OpenStack-dev mailing list