<div dir="ltr">On 10 January 2014 07:40, Jiang, Yunhong <span dir="ltr"><<a href="mailto:yunhong.jiang@intel.com" target="_blank">yunhong.jiang@intel.com</a>></span> wrote:<div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div link="blue" vlink="purple" lang="ZH-CN">
<div>
<p class="MsoNormal"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1f497d" lang="EN-US">Robert, sorry that I’m not fan of * your group * term. To me, *your group” mixed two thing. It’s an extra property provided
by configuration, and also it’s a very-not-flexible mechanism to select devices (you can only select devices based on the ‘group name’ property).<u></u><u></u></span></p>
<p class="MsoNormal"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1f497d" lang="EN-US"><u></u></span></p></div></div></blockquote><div><br></div><div>It is exactly that. It's 0 new config items, 0 new APIs, just an extra tag on the whitelists that are already there (although the proposal suggests changing the name of them to be more descriptive of what they now do). And you talk about flexibility as if this changes frequently, but in fact the grouping / aliasing of devices almost never changes after installation, which is, not coincidentally, when the config on the compute nodes gets set up.<br>
</div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div link="blue" vlink="purple" lang="ZH-CN"><div>
<p style="margin-left:18.0pt">
<u></u><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1f497d" lang="EN-US"><span>1)<span style="font:7.0pt "Times New Roman"">
</span></span></span><u></u><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1f497d" lang="EN-US">A dynamic group is much better. For example, user may want to select GPU device based on vendor id, or based
on vendor_id+device_id. In another word, user want to create group based on vendor_id, or vendor_id+device_id and select devices from these group. John’s proposal is very good, to provide an API to create the PCI flavor(or alias). I prefer flavor because
it’s more openstack style.</span></p></div></div></blockquote><div>I disagree with this. I agree that what you're saying offers a more flexibilibility after initial installation but I have various issues with it.</div>
<div><br>This is directly related to the hardware configuation on each compute node. For (some) other things of this nature, like provider networks, the compute node is the only thing that knows what it has attached to it, and it is the store (in configuration) of that information. If I add a new compute node then it's my responsibility to configure it correctly on attachment, but when I add a compute node (when I'm setting the cluster up, or sometime later on) then it's at that precise point that I know how I've attached it and what hardware it's got on it. Also, it's at this that point in time that I write out the configuration file (not by hand, note; there's almost certainly automation when configuring hundreds of nodes so arguments that 'if I'm writing hundreds of config files one will be wrong' are moot). <br>
<br>I'm also not sure there's much reason to change the available devices dynamically after that, since that's normally an activity that results from changing the physical setup of the machine which implies that actually you're going to have access to and be able to change the config as you do it. John did come up with one case where you might be trying to remove old GPUs from circulation, but it's a very uncommon case that doesn't seem worth coding for, and it's still achievable by changing the config and restarting the compute processes.<br>
<br></div><div>This also reduces the autonomy of the compute node in favour of centralised tracking, which goes against the 'distributed where possible' philosophy of Openstack.<br><br></div><div>Finally, you're not actually removing configuration from the compute node. You still have to configure a whitelist there; in the grouping design you also have to configure grouping (flavouring) on the control node as well. The groups proposal adds one extra piece of information to the whitelists that are already there to mark groups, not a whole new set of config lines.<br>
<br></div><div><br></div><div>To compare scheduling behaviour:<br><br>If I need 4G of RAM, each compute node has reported its summary of free RAM to the scheduler. I look for a compute node with 4G free, and filter the list of compute nodes down. This is a query on n records, n being the number of compute nodes. I schedule to the compute node, which then confirms it does still have 4G free and runs the VM or rejects the request.<br>
<br></div><div>If I need 3 PCI devices and use the current system, each machine has reported its device allocations to the scheduler. With SRIOV multiplying up the number of available devices, it's reporting back hundreds of records per compute node to the schedulers, and the filtering activity is a 3 queries on n * number of PCI devices in cloud records, which could easily end up in the tens or even hundreds of thousands of records for a moderately sized cloud. There compute node also has a record of its device allocations which is also checked and updated before the final request is run.<br>
<br></div><div>If I need 3 PCI devices and use the groups system, each machine has reported its device *summary* to the scheduler. With SRIOV multiplying up the number of available devices, it's still reporting one or a small number of categories, i.e. { net: 100}. The difficulty of scheduling is a query on num groups * n records - fewer, in fact, if some machines have no passthrough devices.<br>
<br></div><div>You can see that there's quite a cost to be paid for having those flexible alias APIs.<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div link="blue" vlink="purple" lang="ZH-CN">
<div><p style="margin-left:18.0pt"><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1f497d" lang="EN-US"><u></u><u></u></span></p><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1f497d" lang="EN-US"><span>4)<span style="font:7.0pt "Times New Roman"">
</span></span></span><span style="font-size:10.5pt;font-family:"Calibri","sans-serif";color:#1f497d" lang="EN-US">IMHO, the core for nova PCI support is *<b>PCI property</b>*. The property means not only generic PCI devices like vendor id, device
id, device type, compute specific property like BDF address, the adjacent switch IP address, but also user defined property like nuertron’s physical net name etc. And then, it’s about how to get these property, how to select/group devices based on the property,
how to store/fetch these properties.</span></div></div></blockquote><div><br>The thing about this is that you don't always, or even often, want to select by property. Some of these properties are just things that you need to tell Neutron, they're not usually keys for scheduling.<br>
</div></div></div></div>