[openstack-dev] [nova] [neutron] PCI pass-through network support

Jiang, Yunhong yunhong.jiang at intel.com
Fri Jan 17 19:02:00 UTC 2014


Robert, thanks for your long reply. Personally I'd prefer option 2/3 as it keep Nova the only entity for PCI management.

Glad you are ok with Ian's proposal and we have solution to resolve the libvirt network scenario in that framework.

Thanks
--jyh

> -----Original Message-----
> From: Robert Li (baoli) [mailto:baoli at cisco.com]
> Sent: Friday, January 17, 2014 7:08 AM
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network
> support
> 
> Yunhong,
> 
> Thank you for bringing that up on the live migration support. In addition
> to the two solutions you mentioned, Irena has a different solution. Let me
> put all the them here again:
>     1. network xml/group based solution.
>        In this solution, each host that supports a provider net/physical
> net can define a SRIOV group (it's hard to avoid the term as you can see
> from the suggestion you made based on the PCI flavor proposal). For each
> SRIOV group supported on a compute node, A network XML will be
> created the
> first time the nova compute service is running on that node.
>         * nova will conduct scheduling, but not PCI device allocation
>         * it's a simple and clean solution, documented in libvirt as the
> way to support live migration with SRIOV. In addition, a network xml is
> nicely mapped into a provider net.
>     2. network xml per PCI device based solution
>        This is the solution you brought up in this email, and Ian
> mentioned this to me as well. In this solution, a network xml is created
> when A VM is created. the network xml needs to be removed once the
> VM is
> removed. This hasn't been tried out as far as I  know.
>     3. interface xml/interface rename based solution
>        Irena brought this up. In this solution, the ethernet interface
> name corresponding to the PCI device attached to the VM needs to be
> renamed. One way to do so without requiring system reboot is to change
> the
> udev rule's file for interface renaming, followed by a udev reload.
> 
> Now, with the first solution, Nova doesn't seem to have control over or
> visibility of the PCI device allocated for the VM before the VM is
> launched. This needs to be confirmed with the libvirt support and see if
> such capability can be provided. This may be a potential drawback if a
> neutron plugin requires detailed PCI device information for operation.
> Irena may provide more insight into this. Ideally, neutron shouldn't need
> this information because the device configuration can be done by libvirt
> invoking the PCI device driver.
> 
> The other two solutions are similar. For example, you can view the second
> solution as one way to rename an interface, or camouflage an interface
> under a network name. They all require additional works before the VM is
> created and after the VM is removed.
> 
> I also agree with you that we should take a look at XenAPI on this.
> 
> 
> With regard to your suggestion on how to implement the first solution with
> some predefined group attribute, I think it definitely can be done. As I
> have pointed it out earlier, the PCI flavor proposal is actually a
> generalized version of the PCI group. In other words, in the PCI group
> proposal, we have one predefined attribute called PCI group, and
> everything else works on top of that. In the PCI flavor proposal,
> attribute is arbitrary. So certainly we can define a particular attribute
> for networking, which let's temporarily call sriov_group. But I can see
> with this idea of predefined attributes, more of them will be required by
> different types of devices in the future. I'm sure it will keep us busy
> although I'm not sure it's in a good way.
> 
> I was expecting you or someone else can provide a practical deployment
> scenario that would justify the flexibilities and the complexities.
> Although I'd prefer to keep it simple and generalize it later once a
> particular requirement is clearly identified, I'm fine to go with it if
> that's most of the folks want to do.
> 
> --Robert
> 
> 
> 
> On 1/16/14 8:36 PM, "yunhong jiang" <yunhong.jiang at linux.intel.com>
> wrote:
> 
> >On Thu, 2014-01-16 at 01:28 +0100, Ian Wells wrote:
> >> To clarify a couple of Robert's points, since we had a conversation
> >> earlier:
> >> On 15 January 2014 23:47, Robert Li (baoli) <baoli at cisco.com> wrote:
> >>           ---  do we agree that BDF address (or device id, whatever
> >>         you call it), and node id shouldn't be used as attributes in
> >>         defining a PCI flavor?
> >>
> >>
> >> Note that the current spec doesn't actually exclude it as an option.
> >> It's just an unwise thing to do.  In theory, you could elect to define
> >> your flavors using the BDF attribute but determining 'the card in this
> >> slot is equivalent to all the other cards in the same slot in other
> >> machines' is probably not the best idea...  We could lock it out as an
> >> option or we could just assume that administrators wouldn't be daft
> >> enough to try.
> >>
> >>
> >>                 * the compute node needs to know the PCI flavor.
> >>         [...]
> >>                           - to support live migration, we need to
> use
> >>         it to create network xml
> >>
> >>
> >> I didn't understand this at first and it took me a while to get what
> >> Robert meant here.
> >>
> >> This is based on Robert's current code for macvtap based live
> >> migration.  The issue is that if you wish to migrate a VM and it's
> >> tied to a physical interface, you can't guarantee that the same
> >> physical interface is going to be used on the target machine, but at
> >> the same time you can't change the libvirt.xml as it comes over with
> >> the migrating machine.  The answer is to define a network and refer
> >> out to it from libvirt.xml.  In Robert's current code he's using the
> >> group name of the PCI devices to create a network containing the list
> >> of equivalent devices (those in the group) that can be macvtapped.
> >> Thus when the host migrates it will find another, equivalent,
> >> interface.  This falls over in the use case under consideration where
> >> a device can be mapped using more than one flavor, so we have to
> >> discard the use case or rethink the implementation.
> >>
> >> There's a more complex solution - I think - where we create a
> >> temporary network for each macvtap interface a machine's going to
> use,
> >> with a name based on the instance UUID and port number, and
> containing
> >> the device to map.  Before starting the migration we would create a
> >> replacement network containing only the new device on the target
> host;
> >> migration would find the network from the name in the libvirt.xml, and
> >> the content of that network would behave identically.  We'd be
> >> creating libvirt networks on the fly and a lot more of them, and we'd
> >> need decent cleanup code too ('when freeing a PCI device, delete any
> >> network it's a member of'), so it all becomes a lot more hairy.
> >> _______________________________________________
> >> OpenStack-dev mailing list
> >> OpenStack-dev at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >Ian/Robert, below is my understanding to the method Robet want to use,
> >am I right?
> >
> >a) Define a libvirt network as  "Using a macvtap "direct" connection"
> >section at "http://libvirt.org/formatnetwork.html . For example, like
> >followed one:
> ><network>
> >        <name> group_name1 </name>
> >        <forward mode="bridge">
> >          <interface dev="eth20"/>
> >          <interface dev="eth21"/>
> >          <interface dev="eth22"/>
> >          <interface dev="eth23"/>
> >          <interface dev="eth24"/>
> >        </forward>
> >      </network>
> >
> >
> >b) When assign SRIOV NIC devices to an instance, as in "Assignment from
> >a pool of SRIOV VFs in a libvirt <network> definition" section in
> >http://wiki.libvirt.org/page/Networking#PCI_Passthrough_of_host_netw
> ork_de
> >vices , use libvirt network definition group_name1. For example, like
> >followed one:
> >
> >  <interface type='network'>
> >    <source network='group_name1'>
> >  </interface>
> >
> >
> >If my understanding is correct, then I have something unclear yet:
> >a) How will the libvirt create the libvirt network (i.e. libvirt network
> >group_name1)? Will it has be created when compute boot up, or it will
> be
> >created before instance creation? I suppose per Robert's design, it's
> >created when compute node is up, am I right?
> >
> >b) If all the interface are used up by instance, what will happen.
> >Considering that 4 interface allocated to the group_name1 libvirt
> >network, and user try to migrate 6 instance with 'group_name1' network,
> >what will happen?
> >
> >And below is my comments:
> >
> >a) Yes, this is in fact different with the current nova PCI support
> >philosophy. Currently we assume Nova owns the devices, manage the
> device
> >assignment to each instance. While in such situation, libvirt network is
> >in fact another layer of PCI device management layer (although very
> >thin) !
> >
> >b) This also remind me that possibly other VMM like XenAPI has special
> >requirement and we need input/confirmation from them also.
> >
> >
> >As how to resolve the issue, I think there are several solution:
> >
> >a) Create one libvirt network for each SRIOV NIC assigned to each
> >instance dynamic, i.e. the libvirt network always has only one interface
> >included, it may be static created or dynamical created. This solution
> >in fact removes the allocation functionality of the libvirt network and
> >leaves only the configuration functionality.
> >
> >b) Change Nova PCI to support a special type of PCI device attribute
> >(like the PCI group). For these PCI attributes , the PCI device
> >scheduler will match a PCI devices only if the attributes is specified
> >clearly in the PCI flavor.
> >  Below is an example:
> >  considering two PCI SRIOV device:
> >	Dev1: BDF=00:0.1, vendor_id=1, device_id=1, group=grp1
> >	Dev2: BDF=00:1.1, vendor_id=1, device_id=2
> >    i.e. Dev2 has no group attributes are specified.
> >
> >   And we mark 'group' attribute as special attributes.
> >
> >   Considering follow flavors:
> >        Flavor1: name=flv1, vendor_id=1
> >	Flavor2: name=flv2, vendor_id=1, group=grp1
> >	Flavor3: name=flv3, group=grp1.
> >
> >   The Dev1 will never be assigned to flv2.
> >   This solution try to separate the devices managed by Nova exclusively
> >and devices managed by Nova/libvirt together.
> >
> >Any idea?
> >
> >Thanks
> >--jyh
> >
> >
> >_______________________________________________
> >OpenStack-dev mailing list
> >OpenStack-dev at lists.openstack.org
> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list