[openstack-dev] [nova] [neutron] PCI pass-through network support

Irena Berezovsky irenab at mellanox.com
Mon Jan 13 12:58:23 UTC 2014

It's great news.
Thank you for bringing Bob's attention to this effort. I'll look for Bob on IRC to get the details.
And of course, core support raises our chances to make PCI pass-through networking into icehouse.


From: Ian Wells [mailto:ijw.ubuntu at cack.org.uk]
Sent: Monday, January 13, 2014 2:02 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Irena, have a word with Bob (rkukura on IRC, East coast), he was talking about what would be needed already and should be able to help you.  Conveniently he's also core. ;)

On 12 January 2014 22:12, Irena Berezovsky <irenab at mellanox.com<mailto:irenab at mellanox.com>> wrote:
Hi John,
Thank you for taking an initiative and summing up the work that need to be done to provide PCI pass-through network support.
The only item I think is missing is the neutron support for PCI pass-through. Currently we have Mellanox Plugin that supports PCI pass-through assuming Mellanox Adapter card embedded switch technology. But in order to have fully integrated  PCI pass-through networking support for the use cases Robert listed on previous mail, the generic neutron PCI pass-through support is required. This can be enhanced with vendor specific task that may differ (Mellanox Embedded switch vs Cisco 802.1BR), but there is still common part of being PCI aware mechanism driver.
I have already started with definition for this part:
I also plan to start coding soon.

Depends on how it goes, I can take also nova parts that integrate with neutron APIs from item 3.


-----Original Message-----
From: John Garbutt [mailto:john at johngarbutt.com<mailto:john at johngarbutt.com>]
Sent: Friday, January 10, 2014 4:34 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support
Apologies for this top post, I just want to move this discussion towards action.

I am traveling next week so it is unlikely that I can make the meetings. Sorry.

Can we please agree on some concrete actions, and who will do the coding?
This also means raising new blueprints for each item of work.
I am happy to review and eventually approve those blueprints, if you email me directly.

Ideas are taken from what we started to agree on, mostly written up here:

What doesn't need doing...

We have PCI whitelist and PCI alias at the moment, let keep those names the same for now.
I personally prefer PCI-flavor, rather than PCI-alias, but lets discuss any rename separately.

We seemed happy with the current system (roughly) around GPU passthrough:
nova flavor-key <three_GPU_attached_30GB> set "pci_passthrough:alias"=" large_GPU:1,small_GPU:2"
nova boot --image some_image --flavor <three_GPU_attached_30GB> <some_name>

Again, we seemed happy with the current PCI whitelist.

Sure, we could optimise the scheduling, but again, please keep that a separate discussion.
Something in the scheduler needs to know how many of each PCI alias are available on each host.
How that information gets there can be change at a later date.

PCI alias is in config, but its probably better defined using host aggregates, or some custom API.
But lets leave that for now, and discuss it separately.
If the need arrises, we can migrate away from the config.

What does need doing...

1) API & CLI changes for "nic-type", and associated tempest tests

* Add a user visible "nic-type" so users can express on of several network types.
* We need a default nic-type, for when the user doesn't specify one (might default to SRIOV in some cases)
* We can easily test the case where the default is virtual and the user expresses a preference for virtual
* Above is much better than not testing it at all.

nova boot --flavor m1.large --image <image_id>
  --nic net-id=<net-id-1>
  --nic net-id=<net-id-2>,nic-type=fast
  --nic net-id=<net-id-3>,nic-type=fast <vm-name>


neutron port-create
  --fixed-ip subnet_id=<subnet-id>,ip_address=
  --nic-type=<slow | fast | foobar>
nova boot --flavor m1.large --image <image_id> --nic port-id=<port-id>

Where nic-type is just an extra bit metadata string that is passed to nova and the VIF driver.

2) Expand PCI alias information

We need extensions to PCI alias so we can group SRIOV devices better.

I still think we are yet to agree on a format, but I would suggest this as a starting point:

  {"vendor_id":"1137","product_id":"0071", address:"*", "attach-type":"direct"},
  {"vendor_id":"1137","product_id":"0072", address:"*", "attach-type":"direct"}  ],
 sriov_info: {}

  {"vendor_id":"1137","product_id":"0071", address:"0:[1-50]:2:*", "attach-type":"macvtap"}
  {"vendor_id":"1234","product_id":"0081", address:"*", "attach-type":"direct"}  ],
 sriov_info: {
  "network_ids": ["net-id-1", "net-id-2"]  } }

  {"vendor_id":"1137","product_id":"0071", address:"*", "attach-type":"direct"}
  {"vendor_id":"1234","product_id":"0081", address:"*", "attach-type":"direct"}  ],
 sriov_info: {
  "network_ids": ["*"]  # this means could attach to any network  } }

The idea being the VIF driver gets passed this info, when network_info includes a nic that matches.
Any other details, like VLAN id, would come from neutron, and passed to the VIF driver as normal.

3) Reading "nic_type" and doing the PCI passthrough of NIC user requests

Not sure we are agreed on this, but basically:
* network_info contains "nic-type" from neutron
* need to select the correct VIF driver
* need to pass matching PCI alias information to VIF driver
* neutron passes details other details (like VLAN id) as before
* nova gives VIF driver an API that allows it to attach PCI devices that are in the whitelist to the VM being configured
* with all this, the VIF driver can do what it needs to do
* lets keep it simple, and expand it as the need arrises

4) Make changes to VIF drivers, so the above is implemented

Depends on (3)

These seems like some good steps to get the basics in place for PCI passthrough networking.
Once its working, we can review it and see if there are things that need to evolve further.

Does that seem like a workable approach?
Who is willing to implement any of (1), (2) and (3)?


On 9 January 2014 17:47, Ian Wells <ijw.ubuntu at cack.org.uk<mailto:ijw.ubuntu at cack.org.uk>> wrote:
> I think I'm in agreement with all of this.  Nice summary, Robert.
> It may not be where the work ends, but if we could get this done the
> rest is just refinement.
> On 9 January 2014 17:49, Robert Li (baoli) <baoli at cisco.com<mailto:baoli at cisco.com>> wrote:
>> Hi Folks,
>> With John joining the IRC, so far, we had a couple of productive
>> meetings in an effort to come to consensus and move forward. Thanks
>> John for doing that, and I appreciate everyone's effort to make it to the daily meeting.
>> Let's reconvene on Monday.
>> But before that, and based on our today's conversation on IRC, I'd
>> like to say a few things. I think that first of all, we need to get
>> agreement on the terminologies that we are using so far. With the
>> current nova PCI passthrough
>>         PCI whitelist: defines all the available PCI passthrough
>> devices on a compute node. pci_passthrough_whitelist=[{
>> "vendor_id":"xxxx","product_id":"xxxx"}]
>>         PCI Alias: criteria defined on the controller node with which
>> requested PCI passthrough devices can be selected from all the PCI
>> passthrough devices available in a cloud.
>>                 Currently it has the following format:
>> pci_alias={"vendor_id":"xxxx", "product_id":"xxxx", "name":"str"}
>>         nova flavor extra_specs: request for PCI passthrough devices
>> can be specified with extra_specs in the format for
>> example:"pci_passthrough:alias"="name:count"
>> As you can see, currently a PCI alias has a name and is defined on
>> the controller. The implications for it is that when matching it
>> against the PCI devices, it has to match the vendor_id and product_id
>> against all the available PCI devices until one is found. The name is
>> only used for reference in the extra_specs. On the other hand, the
>> whitelist is basically the same as the alias without a name.
>> What we have discussed so far is based on something called PCI groups
>> (or PCI flavors as Yongli puts it). Without introducing other
>> complexities, and with a little change of the above representation,
>> we will have something
>> like:
>> pci_passthrough_whitelist=[{ "vendor_id":"xxxx","product_id":"xxxx",
>> "name":"str"}]
>> By doing so, we eliminated the PCI alias. And we call the "name" in
>> above as a PCI group name. You can think of it as combining the
>> definitions of the existing whitelist and PCI alias. And believe it
>> or not, a PCI group is actually a PCI alias. However, with that
>> change of thinking, a lot of benefits can be harvested:
>>          * the implementation is significantly simplified
>>          * provisioning is simplified by eliminating the PCI alias
>>          * a compute node only needs to report stats with something like:
>> PCI group name:count. A compute node processes all the PCI
>> passthrough devices against the whitelist, and assign a PCI group
>> based on the whitelist definition.
>>          * on the controller, we may only need to define the PCI
>> group names. if we use a nova api to define PCI groups (could be
>> private or public, for example), one potential benefit, among other
>> things (validation, etc),  they can be owned by the tenant that
>> creates them. And thus a wholesale of PCI passthrough devices is also possible.
>>          * scheduler only works with PCI group names.
>>          * request for PCI passthrough device is based on PCI-group
>>          * deployers can provision the cloud based on the PCI groups
>>          * Particularly for SRIOV, deployers can design SRIOV PCI
>> groups based on network connectivities.
>> Further, to support SRIOV, we are saying that PCI group names not
>> only can be used in the extra specs, it can also be used in the -nic
>> option and the neutron commands. This allows the most flexibilities
>> and functionalities afforded by SRIOV.
>> Further, we are saying that we can define default PCI groups based on
>> the PCI device's class.
>> For vnic-type (or nic-type), we are saying that it defines the link
>> characteristics of the nic that is attached to a VM: a nic that's
>> connected to a virtual switch, a nic that is connected to a physical
>> switch, or a nic that is connected to a physical switch, but has a
>> host macvtap device in between. The actual names of the choices are
>> not important here, and can be debated.
>> I'm hoping that we can go over the above on Monday. But any comments
>> are welcome by email.
>> Thanks,
>> Robert
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org<mailto:OpenStack-dev at lists.openstack.org>
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org<mailto:OpenStack-dev at lists.openstack.org>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org<mailto:OpenStack-dev at lists.openstack.org>

OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org<mailto:OpenStack-dev at lists.openstack.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140113/a18ccd5a/attachment.html>

More information about the OpenStack-dev mailing list