[openstack-dev] [nova] [neutron] PCI pass-through network support

Jiang, Yunhong yunhong.jiang at intel.com
Fri Jan 10 19:13:22 UTC 2014

Ian, thanks for your reply. Please check my response prefix with 'yjiang5'.


From: Ian Wells [mailto:ijw.ubuntu at cack.org.uk]
Sent: Friday, January 10, 2014 4:08 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

On 10 January 2014 07:40, Jiang, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>> wrote:
Robert, sorry that I'm not fan of * your group * term. To me, *your group" mixed two thing. It's an extra property provided by configuration, and also it's a very-not-flexible mechanism to select devices (you can only select devices based on the 'group name' property).

It is exactly that.  It's 0 new config items, 0 new APIs, just an extra tag on the whitelists that are already there (although the proposal suggests changing the name of them to be more descriptive of what they now do).  And you talk about flexibility as if this changes frequently, but in fact the grouping / aliasing of devices almost never changes after installation, which is, not coincidentally, when the config on the compute nodes gets set up.

1)       A dynamic group is much better. For example, user may want to select GPU device based on vendor id, or based on vendor_id+device_id. In another word, user want to create group based on vendor_id, or vendor_id+device_id and select devices from these group.  John's proposal is very good, to provide an API to create the PCI flavor(or alias). I prefer flavor because it's more openstack style.
I disagree with this.  I agree that what you're saying offers a more flexibilibility after initial installation but I have various issues with it.
[yjiang5] I think you talking is mostly about white list, instead of PCI flavor. PCI flavor is more about PCI request, like I want to have a device with "vendor_id = cisco, device_id= 15454E", or 'vendor_id=intel device_class=nic' , ( because the image have the driver for all Intel NIC card :)  ). While whitelist is to decide the device that is assignable in a host.

This is directly related to the hardware configuation on each compute node.  For (some) other things of this nature, like provider networks, the compute node is the only thing that knows what it has attached to it, and it is the store (in configuration) of that information.  If I add a new compute node then it's my responsibility to configure it correctly on attachment, but when I add a compute node (when I'm setting the cluster up, or sometime later on) then it's at that precise point that I know how I've attached it and what hardware it's got on it.  Also, it's at this that point in time that I write out the configuration file (not by hand, note; there's almost certainly automation when configuring hundreds of nodes so arguments that 'if I'm writing hundreds of config files one will be wrong' are moot).

I'm also not sure there's much reason to change the available devices dynamically after that, since that's normally an activity that results from changing the physical setup of the machine which implies that actually you're going to have access to and be able to change the config as you do it.  John did come up with one case where you might be trying to remove old GPUs from circulation, but it's a very uncommon case that doesn't seem worth coding for, and it's still achievable by changing the config and restarting the compute processes.
[yjiag5] I totally agree with you that whitelist is static defined when provision. I just want to separate the information of 'provider network' to another configuration (like extra information). Whitelist is just white list to decide the device assignable. The provider network is information of the device, it's not in the scope of the white list.
This also reduces the autonomy of the compute node in favour of centralised tracking, which goes against the 'distributed where possible' philosophy of Openstack.
Finally, you're not actually removing configuration from the compute node.  You still have to configure a whitelist there; in the grouping design you also have to configure grouping (flavouring) on the control node as well.  The groups proposal adds one extra piece of information to the whitelists that are already there to mark groups, not a whole new set of config lines.
[yjiang5] Still, while list is to decide the device assignable, not to provide device information. We should mixed functionality to the configuration. If it's ok, I simply want to discard the 'group' term :) The nova PCI flow is simple, compute node provide PCI device (based on white list), the scheduler track the PCI device information (abstracted as pci_stats for performance issue), the API provide method that user specify the device they wanted (the PCI flavor). Current implementation need enhancement on each step of the flow, but I really see no reason to have the "Group". Yes, the 'PCI flavor' in fact create group based on PCI property, but it's better to be expressed as flavor.

To compare scheduling behaviour:

If I  need 4G of RAM, each compute node has reported its summary of free RAM to the scheduler.  I look for a compute node with 4G free, and filter the list of compute nodes down.  This is a query on n records, n being the number of compute nodes.  I schedule to the compute node, which then confirms it does still have 4G free and runs the VM or rejects the request.
If I need 3 PCI devices and use the current system, each machine has reported its device allocations to the scheduler.  With SRIOV multiplying up the number of available devices, it's reporting back hundreds of records per compute node to the schedulers, and the filtering activity is a 3 queries on n * number of PCI devices in cloud records, which could easily end up in the tens or even hundreds of thousands of records for a moderately sized cloud.  There compute node also has a record of its device allocations which is also checked and updated before the final request is run.
If I need 3 PCI devices and use the groups system, each machine has reported its device *summary* to the scheduler.  With SRIOV multiplying up the number of available devices, it's still reporting one or a small number of categories, i.e. { net: 100}.  The difficulty of scheduling is a query on num groups * n records - fewer, in fact, if some machines have no passthrough devices.

[yjiang5] That's the reason we have the pci_stats. The PCI stats is a * summary * for PCI devices information based on *selected* PCI property, like vendor_id, device_id. If we assume the all VFs has the same vendor_id/device_id, it will becomes in fact only one entry in the pci_stats! However, we still keep the detailed information like vendor_id/device_id in the scheduler for decision making, instead of the opaque 'group name'.  And with a configuration to select which property to be used for the pci_stats, like 'vendor_id' only, or 'vendor_id/device_id', it's much flexible.  And if extend the Nova PCI to have user defined property, you can simply add property like 'net' to all your assignable devices, and then configure the 'net' as the only property to get the pci_stats, that's exactly the implementation as your idea !

You can see that there's quite a cost to be paid for having those flexible alias APIs.
4)       IMHO, the core for nova PCI support is *PCI property*. The property means not only generic PCI devices like vendor id, device id, device type, compute specific property like BDF address, the adjacent switch IP address,  but also user defined property like nuertron's physical net name etc. And then, it's about how to get these property, how to select/group devices based on the property, how to store/fetch these properties.

The thing about this is that you don't always, or even often, want to select by property.  Some of these properties are just things that you need to tell Neutron, they're not usually keys for scheduling.
[yjiang5] Yes, that's the reason of pci_stats, which use only selected property for scheduling. But I don't want to fixed the selected property to be only 'groupname'!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140110/c61d712f/attachment.html>

More information about the OpenStack-dev mailing list