[openstack-dev] [nova] [neutron] PCI pass-through network support

Ian Wells ijw.ubuntu at cack.org.uk
Fri Jan 10 12:08:05 UTC 2014

On 10 January 2014 07:40, Jiang, Yunhong <yunhong.jiang at intel.com> wrote:

>  Robert, sorry that I’m not fan of * your group * term. To me, *your
> group” mixed two thing. It’s an extra property provided by configuration,
> and also it’s a very-not-flexible mechanism to select devices (you can only
> select devices based on the ‘group name’ property).
It is exactly that.  It's 0 new config items, 0 new APIs, just an extra tag
on the whitelists that are already there (although the proposal suggests
changing the name of them to be more descriptive of what they now do).  And
you talk about flexibility as if this changes frequently, but in fact the
grouping / aliasing of devices almost never changes after installation,
which is, not coincidentally, when the config on the compute nodes gets set

>  1)       A dynamic group is much better. For example, user may want to
> select GPU device based on vendor id, or based on vendor_id+device_id. In
> another word, user want to create group based on vendor_id, or
> vendor_id+device_id and select devices from these group.  John’s proposal
> is very good, to provide an API to create the PCI flavor(or alias). I
> prefer flavor because it’s more openstack style.
I disagree with this.  I agree that what you're saying offers a more
flexibilibility after initial installation but I have various issues with

This is directly related to the hardware configuation on each compute
node.  For (some) other things of this nature, like provider networks, the
compute node is the only thing that knows what it has attached to it, and
it is the store (in configuration) of that information.  If I add a new
compute node then it's my responsibility to configure it correctly on
attachment, but when I add a compute node (when I'm setting the cluster up,
or sometime later on) then it's at that precise point that I know how I've
attached it and what hardware it's got on it.  Also, it's at this that
point in time that I write out the configuration file (not by hand, note;
there's almost certainly automation when configuring hundreds of nodes so
arguments that 'if I'm writing hundreds of config files one will be wrong'
are moot).

I'm also not sure there's much reason to change the available devices
dynamically after that, since that's normally an activity that results from
changing the physical setup of the machine which implies that actually
you're going to have access to and be able to change the config as you do
it.  John did come up with one case where you might be trying to remove old
GPUs from circulation, but it's a very uncommon case that doesn't seem
worth coding for, and it's still achievable by changing the config and
restarting the compute processes.

This also reduces the autonomy of the compute node in favour of centralised
tracking, which goes against the 'distributed where possible' philosophy of

Finally, you're not actually removing configuration from the compute node.
You still have to configure a whitelist there; in the grouping design you
also have to configure grouping (flavouring) on the control node as well.
The groups proposal adds one extra piece of information to the whitelists
that are already there to mark groups, not a whole new set of config lines.

To compare scheduling behaviour:

If I  need 4G of RAM, each compute node has reported its summary of free
RAM to the scheduler.  I look for a compute node with 4G free, and filter
the list of compute nodes down.  This is a query on n records, n being the
number of compute nodes.  I schedule to the compute node, which then
confirms it does still have 4G free and runs the VM or rejects the request.

If I need 3 PCI devices and use the current system, each machine has
reported its device allocations to the scheduler.  With SRIOV multiplying
up the number of available devices, it's reporting back hundreds of records
per compute node to the schedulers, and the filtering activity is a 3
queries on n * number of PCI devices in cloud records, which could easily
end up in the tens or even hundreds of thousands of records for a
moderately sized cloud.  There compute node also has a record of its device
allocations which is also checked and updated before the final request is

If I need 3 PCI devices and use the groups system, each machine has
reported its device *summary* to the scheduler.  With SRIOV multiplying up
the number of available devices, it's still reporting one or a small number
of categories, i.e. { net: 100}.  The difficulty of scheduling is a query
on num groups * n records - fewer, in fact, if some machines have no
passthrough devices.

You can see that there's quite a cost to be paid for having those flexible
alias APIs.

> 4)       IMHO, the core for nova PCI support is **PCI property**. The
> property means not only generic PCI devices like vendor id, device id,
> device type, compute specific property like BDF address, the adjacent
> switch IP address,  but also user defined property like nuertron’s physical
> net name etc. And then, it’s about how to get these property, how to
> select/group devices based on the property, how to store/fetch these
> properties.

The thing about this is that you don't always, or even often, want to
select by property.  Some of these properties are just things that you need
to tell Neutron, they're not usually keys for scheduling.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140110/84fb08dc/attachment.html>

More information about the OpenStack-dev mailing list