[openstack-dev] [nova] [neutron] PCI pass-through network support
Ian Wells
ijw.ubuntu at cack.org.uk
Fri Jan 10 12:19:56 UTC 2014
In any case, we don't have to decide this now. If we simply allowed the
whitelist to add extra arbitrary properties to the PCI record (like a group
name) and return it to the central server, we could use that in scheduling
for the minute as a group name, we wouldn't implement the APIs for flavors
yet, and we could get a working system that would be minimally changed from
what we already have. We could worry about the scheduling in the
scheduling group, and we could leave the APIs (which, as I say, are a
minimally useful feature) untl later. then we'd have something useful in
short order.
--
Ian.
On 10 January 2014 13:08, Ian Wells <ijw.ubuntu at cack.org.uk> wrote:
> On 10 January 2014 07:40, Jiang, Yunhong <yunhong.jiang at intel.com> wrote:
>
>> Robert, sorry that I’m not fan of * your group * term. To me, *your
>> group” mixed two thing. It’s an extra property provided by configuration,
>> and also it’s a very-not-flexible mechanism to select devices (you can only
>> select devices based on the ‘group name’ property).
>>
>>
> It is exactly that. It's 0 new config items, 0 new APIs, just an extra
> tag on the whitelists that are already there (although the proposal
> suggests changing the name of them to be more descriptive of what they now
> do). And you talk about flexibility as if this changes frequently, but in
> fact the grouping / aliasing of devices almost never changes after
> installation, which is, not coincidentally, when the config on the compute
> nodes gets set up.
>
>> 1) A dynamic group is much better. For example, user may want to
>> select GPU device based on vendor id, or based on vendor_id+device_id. In
>> another word, user want to create group based on vendor_id, or
>> vendor_id+device_id and select devices from these group. John’s proposal
>> is very good, to provide an API to create the PCI flavor(or alias). I
>> prefer flavor because it’s more openstack style.
>>
> I disagree with this. I agree that what you're saying offers a more
> flexibilibility after initial installation but I have various issues with
> it.
>
> This is directly related to the hardware configuation on each compute
> node. For (some) other things of this nature, like provider networks, the
> compute node is the only thing that knows what it has attached to it, and
> it is the store (in configuration) of that information. If I add a new
> compute node then it's my responsibility to configure it correctly on
> attachment, but when I add a compute node (when I'm setting the cluster up,
> or sometime later on) then it's at that precise point that I know how I've
> attached it and what hardware it's got on it. Also, it's at this that
> point in time that I write out the configuration file (not by hand, note;
> there's almost certainly automation when configuring hundreds of nodes so
> arguments that 'if I'm writing hundreds of config files one will be wrong'
> are moot).
>
> I'm also not sure there's much reason to change the available devices
> dynamically after that, since that's normally an activity that results from
> changing the physical setup of the machine which implies that actually
> you're going to have access to and be able to change the config as you do
> it. John did come up with one case where you might be trying to remove old
> GPUs from circulation, but it's a very uncommon case that doesn't seem
> worth coding for, and it's still achievable by changing the config and
> restarting the compute processes.
>
> This also reduces the autonomy of the compute node in favour of
> centralised tracking, which goes against the 'distributed where possible'
> philosophy of Openstack.
>
> Finally, you're not actually removing configuration from the compute
> node. You still have to configure a whitelist there; in the grouping
> design you also have to configure grouping (flavouring) on the control node
> as well. The groups proposal adds one extra piece of information to the
> whitelists that are already there to mark groups, not a whole new set of
> config lines.
>
>
> To compare scheduling behaviour:
>
> If I need 4G of RAM, each compute node has reported its summary of free
> RAM to the scheduler. I look for a compute node with 4G free, and filter
> the list of compute nodes down. This is a query on n records, n being the
> number of compute nodes. I schedule to the compute node, which then
> confirms it does still have 4G free and runs the VM or rejects the request.
>
> If I need 3 PCI devices and use the current system, each machine has
> reported its device allocations to the scheduler. With SRIOV multiplying
> up the number of available devices, it's reporting back hundreds of records
> per compute node to the schedulers, and the filtering activity is a 3
> queries on n * number of PCI devices in cloud records, which could easily
> end up in the tens or even hundreds of thousands of records for a
> moderately sized cloud. There compute node also has a record of its device
> allocations which is also checked and updated before the final request is
> run.
>
> If I need 3 PCI devices and use the groups system, each machine has
> reported its device *summary* to the scheduler. With SRIOV multiplying up
> the number of available devices, it's still reporting one or a small number
> of categories, i.e. { net: 100}. The difficulty of scheduling is a query
> on num groups * n records - fewer, in fact, if some machines have no
> passthrough devices.
>
> You can see that there's quite a cost to be paid for having those flexible
> alias APIs.
>
>> 4) IMHO, the core for nova PCI support is **PCI property**. The
>> property means not only generic PCI devices like vendor id, device id,
>> device type, compute specific property like BDF address, the adjacent
>> switch IP address, but also user defined property like nuertron’s physical
>> net name etc. And then, it’s about how to get these property, how to
>> select/group devices based on the property, how to store/fetch these
>> properties.
>>
>
> The thing about this is that you don't always, or even often, want to
> select by property. Some of these properties are just things that you need
> to tell Neutron, they're not usually keys for scheduling.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140110/1b3ca0b6/attachment.html>
More information about the OpenStack-dev
mailing list