[openstack-dev] [nova] [neutron] PCI pass-through network support

Robert Li (baoli) baoli at cisco.com
Tue Jan 21 14:45:32 UTC 2014

Just one comment:
      The devices allocated for an instance are immediately known after
the domain is created. Therefore it's possible to do a port update and
have the device configured while the instance is booting.


On 1/19/14 2:15 AM, "Irena Berezovsky" <irenab at mellanox.com> wrote:

>Hi Robert, Yonhong,
>Although network XML solution (option 1) is very elegant, it has one
>major disadvantage. As Robert mentioned, the disadvantage of the network
>XML is the inability to know what SR-IOV PCI device was actually
>allocated. When neutron is responsible to set networking configuration,
>manage admin status, set security groups, it should be able to identify
>the SR-IOV PCI device to apply configuration. Within current libvirt
>Network XML implementation, it does not seem possible.
>Between option (2) and (3), I do not have any preference, it should be as
>simple as possible.
>Option (3) that I raised can be achieved by renaming the network
>interface of Virtual Function via 'ip link set  name'. Interface logical
>name can be based on neutron port UUID. This will  allow neutron to
>discover devices, if backend plugin requires it. Once VM is migrating,
>suitable Virtual Function on the target node should be allocated, and
>then its corresponding network interface should be renamed to same
>logical name. This can be done without system rebooting. Still need to
>check how the Virtual Function corresponding network interface can be
>returned to its original name once is not used anymore as VM vNIC.
>-----Original Message-----
>From: Jiang, Yunhong [mailto:yunhong.jiang at intel.com]
>Sent: Friday, January 17, 2014 9:06 PM
>To: OpenStack Development Mailing List (not for usage questions)
>Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network
>Robert, thanks for your long reply. Personally I'd prefer option 2/3 as
>it keep Nova the only entity for PCI management.
>Glad you are ok with Ian's proposal and we have solution to resolve the
>libvirt network scenario in that framework.
>> -----Original Message-----
>> From: Robert Li (baoli) [mailto:baoli at cisco.com]
>> Sent: Friday, January 17, 2014 7:08 AM
>> To: OpenStack Development Mailing List (not for usage questions)
>> Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network
>> support
>> Yunhong,
>> Thank you for bringing that up on the live migration support. In
>> addition to the two solutions you mentioned, Irena has a different
>> solution. Let me put all the them here again:
>>     1. network xml/group based solution.
>>        In this solution, each host that supports a provider
>> net/physical net can define a SRIOV group (it's hard to avoid the term
>> as you can see from the suggestion you made based on the PCI flavor
>> proposal). For each SRIOV group supported on a compute node, A network
>> XML will be created the first time the nova compute service is running
>> on that node.
>>         * nova will conduct scheduling, but not PCI device allocation
>>         * it's a simple and clean solution, documented in libvirt as
>> the way to support live migration with SRIOV. In addition, a network
>> xml is nicely mapped into a provider net.
>>     2. network xml per PCI device based solution
>>        This is the solution you brought up in this email, and Ian
>> mentioned this to me as well. In this solution, a network xml is
>> created when A VM is created. the network xml needs to be removed once
>> the VM is removed. This hasn't been tried out as far as I  know.
>>     3. interface xml/interface rename based solution
>>        Irena brought this up. In this solution, the ethernet interface
>> name corresponding to the PCI device attached to the VM needs to be
>> renamed. One way to do so without requiring system reboot is to change
>> the udev rule's file for interface renaming, followed by a udev
>> reload.
>> Now, with the first solution, Nova doesn't seem to have control over
>> or visibility of the PCI device allocated for the VM before the VM is
>> launched. This needs to be confirmed with the libvirt support and see
>> if such capability can be provided. This may be a potential drawback
>> if a neutron plugin requires detailed PCI device information for
>> Irena may provide more insight into this. Ideally, neutron shouldn't
>> need this information because the device configuration can be done by
>> libvirt invoking the PCI device driver.
>> The other two solutions are similar. For example, you can view the
>> second solution as one way to rename an interface, or camouflage an
>> interface under a network name. They all require additional works
>> before the VM is created and after the VM is removed.
>> I also agree with you that we should take a look at XenAPI on this.
>> With regard to your suggestion on how to implement the first solution
>> with some predefined group attribute, I think it definitely can be
>> done. As I have pointed it out earlier, the PCI flavor proposal is
>> actually a generalized version of the PCI group. In other words, in
>> the PCI group proposal, we have one predefined attribute called PCI
>> group, and everything else works on top of that. In the PCI flavor
>> proposal, attribute is arbitrary. So certainly we can define a
>> particular attribute for networking, which let's temporarily call
>> sriov_group. But I can see with this idea of predefined attributes,
>> more of them will be required by different types of devices in the
>> future. I'm sure it will keep us busy although I'm not sure it's in a
>>good way.
>> I was expecting you or someone else can provide a practical deployment
>> scenario that would justify the flexibilities and the complexities.
>> Although I'd prefer to keep it simple and generalize it later once a
>> particular requirement is clearly identified, I'm fine to go with it
>> if that's most of the folks want to do.
>> --Robert
>> On 1/16/14 8:36 PM, "yunhong jiang" <yunhong.jiang at linux.intel.com>
>> wrote:
>> >On Thu, 2014-01-16 at 01:28 +0100, Ian Wells wrote:
>> >> To clarify a couple of Robert's points, since we had a conversation
>> >> earlier:
>> >> On 15 January 2014 23:47, Robert Li (baoli) <baoli at cisco.com> wrote:
>> >>           ---  do we agree that BDF address (or device id, whatever
>> >>         you call it), and node id shouldn't be used as attributes in
>> >>         defining a PCI flavor?
>> >>
>> >>
>> >> Note that the current spec doesn't actually exclude it as an option.
>> >> It's just an unwise thing to do.  In theory, you could elect to
>> >> define your flavors using the BDF attribute but determining 'the
>> >> card in this slot is equivalent to all the other cards in the same
>> >> slot in other machines' is probably not the best idea...  We could
>> >> lock it out as an option or we could just assume that
>> >> administrators wouldn't be daft enough to try.
>> >>
>> >>
>> >>                 * the compute node needs to know the PCI flavor.
>> >>         [...]
>> >>                           - to support live migration, we need to
>> use
>> >>         it to create network xml
>> >>
>> >>
>> >> I didn't understand this at first and it took me a while to get
>> >> what Robert meant here.
>> >>
>> >> This is based on Robert's current code for macvtap based live
>> >> migration.  The issue is that if you wish to migrate a VM and it's
>> >> tied to a physical interface, you can't guarantee that the same
>> >> physical interface is going to be used on the target machine, but
>> >> at the same time you can't change the libvirt.xml as it comes over
>> >> with the migrating machine.  The answer is to define a network and
>> >> refer out to it from libvirt.xml.  In Robert's current code he's
>> >> using the group name of the PCI devices to create a network
>> >> containing the list of equivalent devices (those in the group) that
>>can be macvtapped.
>> >> Thus when the host migrates it will find another, equivalent,
>> >> interface.  This falls over in the use case under consideration
>> >> where a device can be mapped using more than one flavor, so we have
>> >> to discard the use case or rethink the implementation.
>> >>
>> >> There's a more complex solution - I think - where we create a
>> >> temporary network for each macvtap interface a machine's going to
>> use,
>> >> with a name based on the instance UUID and port number, and
>> containing
>> >> the device to map.  Before starting the migration we would create a
>> >> replacement network containing only the new device on the target
>> host;
>> >> migration would find the network from the name in the libvirt.xml,
>> >> and the content of that network would behave identically.  We'd be
>> >> creating libvirt networks on the fly and a lot more of them, and
>> >> we'd need decent cleanup code too ('when freeing a PCI device,
>> >> delete any network it's a member of'), so it all becomes a lot more
>> >> _______________________________________________
>> >> OpenStack-dev mailing list
>> >> OpenStack-dev at lists.openstack.org
>> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >
>> >Ian/Robert, below is my understanding to the method Robet want to
>> >use, am I right?
>> >
>> >a) Define a libvirt network as  "Using a macvtap "direct" connection"
>> >section at "http://libvirt.org/formatnetwork.html . For example, like
>> >followed one:
>> ><network>
>> >        <name> group_name1 </name>
>> >        <forward mode="bridge">
>> >          <interface dev="eth20"/>
>> >          <interface dev="eth21"/>
>> >          <interface dev="eth22"/>
>> >          <interface dev="eth23"/>
>> >          <interface dev="eth24"/>
>> >        </forward>
>> >      </network>
>> >
>> >
>> >b) When assign SRIOV NIC devices to an instance, as in "Assignment
>> >from a pool of SRIOV VFs in a libvirt <network> definition" section
>> >in 
>> >http://wiki.libvirt.org/page/Networking#PCI_Passthrough_of_host_netw
>> ork_de
>> >vices , use libvirt network definition group_name1. For example, like
>> >followed one:
>> >
>> >  <interface type='network'>
>> >    <source network='group_name1'>
>> >  </interface>
>> >
>> >
>> >If my understanding is correct, then I have something unclear yet:
>> >a) How will the libvirt create the libvirt network (i.e. libvirt
>> >network group_name1)? Will it has be created when compute boot up, or
>> >it will
>> be
>> >created before instance creation? I suppose per Robert's design, it's
>> >created when compute node is up, am I right?
>> >
>> >b) If all the interface are used up by instance, what will happen.
>> >Considering that 4 interface allocated to the group_name1 libvirt
>> >network, and user try to migrate 6 instance with 'group_name1'
>> >network, what will happen?
>> >
>> >And below is my comments:
>> >
>> >a) Yes, this is in fact different with the current nova PCI support
>> >philosophy. Currently we assume Nova owns the devices, manage the
>> device
>> >assignment to each instance. While in such situation, libvirt network
>> >is in fact another layer of PCI device management layer (although
>> >very
>> >thin) !
>> >
>> >b) This also remind me that possibly other VMM like XenAPI has
>> >special requirement and we need input/confirmation from them also.
>> >
>> >
>> >As how to resolve the issue, I think there are several solution:
>> >
>> >a) Create one libvirt network for each SRIOV NIC assigned to each
>> >instance dynamic, i.e. the libvirt network always has only one
>> >interface included, it may be static created or dynamical created.
>> >This solution in fact removes the allocation functionality of the
>> >libvirt network and leaves only the configuration functionality.
>> >
>> >b) Change Nova PCI to support a special type of PCI device attribute
>> >(like the PCI group). For these PCI attributes , the PCI device
>> >scheduler will match a PCI devices only if the attributes is
>> >specified clearly in the PCI flavor.
>> >  Below is an example:
>> >  considering two PCI SRIOV device:
>> >	Dev1: BDF=00:0.1, vendor_id=1, device_id=1, group=grp1
>> >	Dev2: BDF=00:1.1, vendor_id=1, device_id=2
>> >    i.e. Dev2 has no group attributes are specified.
>> >
>> >   And we mark 'group' attribute as special attributes.
>> >
>> >   Considering follow flavors:
>> >        Flavor1: name=flv1, vendor_id=1
>> >	Flavor2: name=flv2, vendor_id=1, group=grp1
>> >	Flavor3: name=flv3, group=grp1.
>> >
>> >   The Dev1 will never be assigned to flv2.
>> >   This solution try to separate the devices managed by Nova
>> >exclusively and devices managed by Nova/libvirt together.
>> >
>> >Any idea?
>> >
>> >Thanks
>> >--jyh
>> >
>> >
>> >_______________________________________________
>> >OpenStack-dev mailing list
>> >OpenStack-dev at lists.openstack.org
>> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>OpenStack-dev mailing list
>OpenStack-dev at lists.openstack.org
>OpenStack-dev mailing list
>OpenStack-dev at lists.openstack.org

More information about the OpenStack-dev mailing list