[openstack-dev] [nova] [neutron] PCI pass-through network support

Jiang, Yunhong yunhong.jiang at intel.com
Fri Jan 10 18:30:20 UTC 2014

Robert, thanks for your reply. Please check reply prefix with 'yjiang5'.


From: Robert Li (baoli) [mailto:baoli at cisco.com]
Sent: Friday, January 10, 2014 9:16 AM
To: Jiang, Yunhong; OpenStack Development Mailing List (not for usage questions); Irena Berezovsky; Sandhya Dasu (sadasu); Itzik Brown; john at johngarbutt.com; He, Yongli
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Hi Yunhong,

I appreciate your comments. Please see inline...


On 1/10/14 1:40 AM, "Jiang, Yunhong" <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>> wrote:

Robert, sorry that I'm not fan of * your group * term. To me, *your group" mixed two thing. It's an extra property provided by configuration, and also it's a very-not-flexible mechanism to select devices (you can only select devices based on the 'group name' property).

1)      A dynamic group is much better. For example, user may want to select GPU device based on vendor id, or based on vendor_id+device_id. In another word, user want to create group based on vendor_id, or vendor_id+device_id and select devices from these group.  John's proposal is very good, to provide an API to create the PCI flavor(or alias). I prefer flavor because it's more openstack style.

I'm not sure what you mean by a dynamic group. But a PCI group can be dynamically created on the controller. The whitelist definition allows the grouping based on vendor_id or vendor_id + product_id, etc. The name of PCI group makes more sense in terms of SRIOV, but the name of PCI flavor may make more sense for GPU because a user may want something from a specific vendor as you have indicated.

So far, our discussion has been largely based on the infrastructure that is currently existing in nova, or largely confined within the existing PCI passthrough implemenation. If my understanding is correct, then devices belonging to different aliases shouldn't overlap. Otherwise, the stats accounting would become useless. So the question is do we allow overlapping of devices that can be classified into different aliases at the same time. If the answer is yes, then some fundamental change would be required.

[yjiang5] no, the devices belongs to different alias can overlap. The alias (or flavor) is purely a definition of the PCI property requirement, that's the reason I think the pci_flavor is much better name. Why do you think stats accounting is useless if device is overlap?

Talking about the flexibility you mentioned earlier, let me try to describe this if I understand you correctly:
         -- whitelist defines devices available in a compute node. The collection of them determines all the devices available in a cloud.
         -- At any time, PCI groups (or PCI flavors) can be defined on the controller that defines criteria (in terms of vendor_id, product_id, bdf, etc) to locate a particular device.

I don't think it's a bad idea. But Would it require the controller to manage all the PCI devices available in the cloud? and/or how would stats be managed per PCI flavor? Can we clearly define how to enable this maximum flexibility? It's certainly  not there today.

[yjiang5] The stats has nothing to do with PCI flavor. The stats gives the status of the devices in the cloud, and the PCI flavor is * just * for user requirement like the instance flavor.

2)      As for the second thing of your 'group', I'd understand it as an extra property provided by configuration.  I don't think we should put it into the white list, which is to configure devices that are assignable.  I'd add another configuration option to provide extra attribute to devices. When nova compute is up, it will parse this configuration and add them to the corresponding PCI devices. I don't think adding another configuration will cause too many trouble to deployment. Openstack already have a lot of configuration items :)

Not sure how exactly it's going to be done. But the patches that Yongli has posted seems to be adding the pci-flaovr into the whitelist. We are just trying to see the pci-flavor in a different angle (as posted in this thread), and that would make things a lot different.

[yongli] I don't like yongli's patch either :) I think his implementation is different with John's pci-flavor design.

3)      I think currently we mixed the neutron and nova design. To me, Neutron SRIOV support is a user of nova PCI support. Thus we should firstly analysis the requirement from neutron PCI support to nova PCI support in a more generic  way, and then, we can discuss how we enhance the nova PCI support, or, if you want, re-design the nova PCI support. IMHO, if don't consider network, current implementation should be ok.

I don't see that we are trying to mix the design. But I agree that we should provide SRIOV requirements, which we have already discussed in our previous threads. Let me try it here, and folks, please add yours if I'm missing anything:
           1. A SRIOV device can be used as a NIC to be attached to a VM (or domain). This implies that a PCI passthrough device is recognized as an SRIOV device and corresponding networking handling as required by the domain is performed to attach it to the VM as a NIC.
[yjiang5] For "is recognized as an SRIOV device", it can be achieved by check the "PF" property of PCI devices, or it can be achieved by simply check the vendor_id/device_id.  What's the exact meaning of "corresponding networking handling"? Is it about the nova PCI deice design? I'd think it as a Neutron specific functionality. Or, you need translate it to PCI device requirement.
           2. A SRIOV device should be selected to be attached to a VM based on the VM's network connectivity.
[yjiang5] This requirement is not clear. So the key is "based on the VM's network connectivity", right? If yes, can I translate the requirement to "need specific the network connectivity property  for a PCI devices and track that information"? If yes, it need more information as a requirement, like how the VM's network connectivity information provide, statically at installation time, or dynamically decided at neutron side? will it be changed whenever attach/de-attach to instance or it will be stablel. After answer these question, we will check if user defined property (the pci_extra_info configuration proposed ) meet this requirement.
           3. If a VM has multiple SRIOV NICs, it should be possible to locate the SRIOV device assigned to the corresponding NIC.
[yjiang5] You mean locate the SRIOV device to the NIC, or locate NIC to the SRIOV device?
If former, it's not about the PCI support, but neutron/network issue on how to track the PCI information in the NIC definition.
If latter, then it means we need extend the PCI device object, to not only track the allocation status (free/assigned), but track the allocation information (NIC name).
           4. A SRIOV-capable compute node may not be used as a host for VMs that don't require SRIOV capability
[yjiang5] IMHO, a) This is just scheduler policy and should be achieved through specific scheduler filter.
b) From design point of view, nova PCI support need only provide method to get PCI information on a compute node, and method to get PCI request information on an instance create. As how to achieve the filter, it's the implementation detail of that specific filter.
           5. Specifically as required by 2 & 3, pci-flavor (or pci-alias, or pci-group, whatever it's called) should be allowed in -nic and neutron commands.
       [yjiang5] I like this type of requirement, clearly and specifically  :) This is about how to fetch the PCI request information. Currently we only get it through instance flavor extra specs. With the neutron support, we need fetch such information from nic side. There are possibly have more source of request, like image property may require some specific devices like encryption devices.

When exploring the existing nova PCI passthrough, we figured out how to meet those requirements, and as a result we started the conversation. SRIOV requirements would certainly influence the overall PCI passthrough design, I presume. The bottom line is that we want those requirements to be met.

4)      IMHO, the core for nova PCI support is *PCI property*. The property means not only generic PCI devices like vendor id, device id, device type, compute specific property like BDF address, the adjacent switch IP address,  but also user defined property like nuertron's physical net name etc. And then, it's about how to get these property, how to select/group devices based on the property, how to store/fetch these properties.

I agree. But that's exactly what we are trying to accomplish.


From: Robert Li (baoli) [mailto:baoli at cisco.com]
Sent: Thursday, January 09, 2014 8:49 AM
To: OpenStack Development Mailing List (not for usage questions); Irena Berezovsky; Sandhya Dasu (sadasu); Jiang, Yunhong; Itzik Brown; john at johngarbutt.com<mailto:john at johngarbutt.com>; He, Yongli
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Hi Folks,

With John joining the IRC, so far, we had a couple of productive meetings in an effort to come to consensus and move forward. Thanks John for doing that, and I appreciate everyone's effort to make it to the daily meeting. Let's reconvene on Monday.

But before that, and based on our today's conversation on IRC, I'd like to say a few things. I think that first of all, we need to get agreement on the terminologies that we are using so far. With the current nova PCI passthrough

        PCI whitelist: defines all the available PCI passthrough devices on a compute node. pci_passthrough_whitelist=[{ "vendor_id":"xxxx","product_id":"xxxx"}]
        PCI Alias: criteria defined on the controller node with which requested PCI passthrough devices can be selected from all the PCI passthrough devices available in a cloud.
                Currently it has the following format: pci_alias={"vendor_id":"xxxx", "product_id":"xxxx", "name":"str"}

        nova flavor extra_specs: request for PCI passthrough devices can be specified with extra_specs in the format for example:"pci_passthrough:alias"="name:count"

As you can see, currently a PCI alias has a name and is defined on the controller. The implications for it is that when matching it against the PCI devices, it has to match the vendor_id and product_id against all the available PCI devices until one is found. The name is only used for reference in the extra_specs. On the other hand, the whitelist is basically the same as the alias without a name.

What we have discussed so far is based on something called PCI groups (or PCI flavors as Yongli puts it). Without introducing other complexities, and with a little change of the above representation, we will have something like:

pci_passthrough_whitelist=[{ "vendor_id":"xxxx","product_id":"xxxx", "name":"str"}]

By doing so, we eliminated the PCI alias. And we call the "name" in above as a PCI group name. You can think of it as combining the definitions of the existing whitelist and PCI alias. And believe it or not, a PCI group is actually a PCI alias. However, with that change of thinking, a lot of benefits can be harvested:

         * the implementation is significantly simplified
         * provisioning is simplified by eliminating the PCI alias
         * a compute node only needs to report stats with something like: PCI group name:count. A compute node processes all the PCI passthrough devices against the whitelist, and assign a PCI group based on the whitelist definition.
         * on the controller, we may only need to define the PCI group names. if we use a nova api to define PCI groups (could be private or public, for example), one potential benefit, among other things (validation, etc),  they can be owned by the tenant that creates them. And thus a wholesale of PCI passthrough devices is also possible.
         * scheduler only works with PCI group names.
         * request for PCI passthrough device is based on PCI-group
         * deployers can provision the cloud based on the PCI groups
         * Particularly for SRIOV, deployers can design SRIOV PCI groups based on network connectivities.

Further, to support SRIOV, we are saying that PCI group names not only can be used in the extra specs, it can also be used in the -nic option and the neutron commands. This allows the most flexibilities and functionalities afforded by SRIOV.

Further, we are saying that we can define default PCI groups based on the PCI device's class.

For vnic-type (or nic-type), we are saying that it defines the link characteristics of the nic that is attached to a VM: a nic that's connected to a virtual switch, a nic that is connected to a physical switch, or a nic that is connected to a physical switch, but has a host macvtap device in between. The actual names of the choices are not important here, and can be debated.

I'm hoping that we can go over the above on Monday. But any comments are welcome by email.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140110/db63173f/attachment.html>

More information about the OpenStack-dev mailing list