<div dir="ltr">Agree with that, whatever the tweak inventory or traits, none of them works.<div><br></div><div>Same as VGPU, we can support pre-programmed mode for multiple-functions region, and each region only can support one type function.</div><div><br></div><div><div class="gmail_extra" style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial">There are two reasons why Cyborg has a filter:</div><div class="gmail_extra" style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial">* records the usage of functions in a region</div><div class="gmail_extra" style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial">* records which function is programmed.</div></div><div><br></div><div>For #1, each region provider multiple functions. Each function can be</div><div>assigned to a VM. So we should create ResourceProvider for the region. And</div><div>the resource class is function. That is similar to the SR-IOV device. The region(The PF)</div><div>provides functions (VFs).<br><div class="gmail_extra"><br></div><div class="gmail_extra">For #2, We should use trait to distinguish the function type.</div><div class="gmail_extra"><br></div><div class="gmail_extra">Then we didn't keep any inventory info in the cyborg again, and we needn't any filter in cyborg also,</div><div class="gmail_extra">and there is no race condition anymore.</div><div class="gmail_extra"><br></div><div class="gmail_extra"><div class="gmail_quote">2018-03-29 2:48 GMT+08:00 Eric Fried <span dir="ltr"><<a href="mailto:openstack@fried.cc" target="_blank">openstack@fried.cc</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Sundar-<br>

<br>

        We're running across this issue in several places right now.   One<br>

thing that's definitely not going to get traction is<br>

automatically/implicitly tweaking inventory in one resource class when<br>

an allocation is made on a different resource class (whether in the same<br>

or different RPs).<br>

<br>

        Slightly less of a nonstarter, but still likely to get significant<br>

push-back, is the idea of tweaking traits on the fly.  For example, your<br>

vGPU case might be modeled as:<br>

<br>

PGPU_RP: {<br>

  inventory: {<br>

      CUSTOM_VGPU_TYPE_A: 2,<br>

      CUSTOM_VGPU_TYPE_B: 4,<br>

  }<br>

  traits: [<br>

      CUSTOM_VGPU_TYPE_A_CAPABLE,<br>

      CUSTOM_VGPU_TYPE_B_CAPABLE,<br>

  ]<br>

}<br>

<br>

        The request would come in for<br>

resources=CUSTOM_VGPU_TYPE_A:<wbr>1&required=VGPU_TYPE_A_<wbr>CAPABLE, resulting<br>

in an allocation of CUSTOM_VGPU_TYPE_A:1.  Now while you're processing<br>

that, you would *remove* CUSTOM_VGPU_TYPE_B_CAPABLE from the PGPU_RP.<br>

So it doesn't matter that there's still inventory of<br>

CUSTOM_VGPU_TYPE_B:4, because a request including<br>

required=CUSTOM_VGPU_TYPE_B_<wbr>CAPABLE won't be satisfied by this RP.<br>

There's of course a window between when the initial allocation is made<br>

and when you tweak the trait list.  In that case you'll just have to<br>

fail the loser.  This would be like any other failure in e.g. the spawn<br>

process; it would bubble up, the allocation would be removed; retries<br>

might happen or whatever.<br>

<br>

        Like I said, you're likely to get a lot of resistance to this idea as<br>

well.  (Though TBH, I'm not sure how we can stop you beyond -1'ing your<br>

patches; there's nothing about placement that disallows it.)<br>

<br>

        The simple-but-inefficient solution is simply that we'd still be able<br>

to make allocations for vGPU type B, but you would have to fail right<br>

away when it came down to cyborg to attach the resource.  Which is code<br>

you pretty much have to write anyway.  It's an improvement if cyborg<br>

gets to be involved in the post-get-allocation-candidates<br>

weighing/filtering step, because you can do that check at that point to<br>

help filter out the candidates that would fail.  Of course there's still<br>

a race condition there, but it's no different than for any other resource.<br>

<br>

efried<br>

<span class=""><br>

On 03/28/2018 12:27 PM, Nadathur, Sundar wrote:<br>

> Hi Eric and all,<br>

>     I should have clarified that this race condition happens only for<br>

> the case of devices with multiple functions. There is a prior thread<br>

</span>> <<a href="http://lists.openstack.org/pipermail/openstack-dev/2018-March/127882.html" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>pipermail/openstack-dev/2018-<wbr>March/127882.html</a>><br>

<span class="">> about it. I was trying to get a solution within Cyborg, but that faces<br>

> this race condition as well.<br>

><br>

> IIUC, this situation is somewhat similar to the issue with vGPU types<br>

</span>> <<a href="http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-03-27.log.html#t2018-03-27T13:41:00" rel="noreferrer" target="_blank">http://eavesdrop.openstack.<wbr>org/irclogs/%23openstack-nova/<wbr>%23openstack-nova.2018-03-27.<wbr>log.html#t2018-03-27T13:41:00</a>><br>

<div class="HOEnZb"><div class="h5">> (thanks to Alex Xu for pointing this out). In the latter case, we could<br>

> start with an inventory of (vgpu-type-a: 2; vgpu-type-b: 4).  But, after<br>

> consuming a unit of  vGPU-type-a, ideally the inventory should change<br>

> to: (vgpu-type-a: 1; vgpu-type-b: 0). With multi-function accelerators,<br>

> we start with an RP inventory of (region-type-A: 1, function-X: 4). But,<br>

> after consuming a unit of that function, ideally the inventory should<br>

> change to: (region-type-A: 0, function-X: 3).<br>

><br>

> I understand that this approach is controversial :) Also, one difference<br>

> from the vGPU case is that the number and count of vGPU types is static,<br>

> whereas with FPGAs, one could reprogram it to result in more or fewer<br>

> functions. That said, we could hopefully keep this analogy in mind for<br>

> future discussions.<br>

><br>

> We probably will not support multi-function accelerators in Rocky. This<br>

> discussion is for the longer term.<br>

><br>

> Regards,<br>

> Sundar<br>

><br>

> On 3/23/2018 12:44 PM, Eric Fried wrote:<br>

>> Sundar-<br>

>><br>

>>      First thought is to simplify by NOT keeping inventory information in<br>

>> the cyborg db at all.  The provider record in the placement service<br>

>> already knows the device (the provider ID, which you can look up in the<br>

>> cyborg db) the host (the root_provider_uuid of the provider representing<br>

>> the device) and the inventory, and (I hope) you'll be augmenting it with<br>

>> traits indicating what functions it's capable of.  That way, you'll<br>

>> always get allocation candidates with devices that *can* load the<br>

>> desired function; now you just have to engage your weigher to prioritize<br>

>> the ones that already have it loaded so you can prefer those.<br>

>><br>

>>      Am I missing something?<br>

>><br>

>>              efried<br>

>><br>

>> On 03/22/2018 11:27 PM, Nadathur, Sundar wrote:<br>

>>> Hi all,<br>

>>>     There seems to be a possibility of a race condition in the<br>

>>> Cyborg/Nova flow. Apologies for missing this earlier. (You can refer to<br>

>>> the proposed Cyborg/Nova spec<br>

>>> <<a href="https://review.openstack.org/#/c/554717/1/doc/specs/rocky/cyborg-nova-sched.rst" rel="noreferrer" target="_blank">https://review.openstack.org/<wbr>#/c/554717/1/doc/specs/rocky/<wbr>cyborg-nova-sched.rst</a>><br>

>>> for details.)<br>

>>><br>

>>> Consider the scenario where the flavor specifies a resource class for a<br>

>>> device type, and also specifies a function (e.g. encrypt) in the extra<br>

>>> specs. The Nova scheduler would only track the device type as a<br>

>>> resource, and Cyborg needs to track the availability of functions.<br>

>>> Further, to keep it simple, say all the functions exist all the time (no<br>

>>> reprogramming involved).<br>

>>><br>

>>> To recap, here is the scheduler flow for this case:<br>

>>><br>

>>>   * A request spec with a flavor comes to Nova conductor/scheduler. The<br>

>>>     flavor has a device type as a resource class, and a function in the<br>

>>>     extra specs.<br>

>>>   * Placement API returns the list of RPs (compute nodes) which contain<br>

>>>     the requested device types (but not necessarily the function).<br>

>>>   * Cyborg will provide a custom filter which queries Cyborg DB. This<br>

>>>     needs to check which hosts contain the needed function, and filter<br>

>>>     out the rest.<br>

>>>   * The scheduler selects one node from the filtered list, and the<br>

>>>     request goes to the compute node.<br>

>>><br>

>>> For the filter to work, the Cyborg DB needs to maintain a table with<br>

>>> triples of (host, function type, #free units). The filter checks if a<br>

>>> given host has one or more free units of the requested function type.<br>

>>> But, to keep the # free units up to date, Cyborg on the selected compute<br>

>>> node needs to notify the Cyborg API to decrement the #free units when an<br>

>>> instance is spawned, and to increment them when resources are released.<br>

>>><br>

>>> Therein lies the catch: this loop from the compute node to controller is<br>

>>> susceptible to race conditions. For example, if two simultaneous<br>

>>> requests each ask for function A, and there is only one unit of that<br>

>>> available, the Cyborg filter will approve both, both may land on the<br>

>>> same host, and one will fail. This is because Cyborg on the controller<br>

>>> does not decrement resource usage due to one request before processing<br>

>>> the next request.<br>

>>><br>

>>> This is similar to this previous Nova scheduling issue<br>

>>> <<a href="https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/placement-claims.html" rel="noreferrer" target="_blank">https://specs.openstack.org/<wbr>openstack/nova-specs/specs/<wbr>pike/implemented/placement-<wbr>claims.html</a>>.<br>

>>> That was solved by having the scheduler claim a resource in Placement<br>

>>> for the selected node. I don't see an analog for Cyborg, since it would<br>

>>> not know which node is selected.<br>

>>><br>

>>> Thanks in advance for suggestions and solutions.<br>

>>><br>

>>> Regards,<br>

>>> Sundar<br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>><br>

>>> ______________________________<wbr>______________________________<wbr>______________<br>

>>> OpenStack Development Mailing List (not for usage questions)<br>

>>> Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.<wbr>openstack.org?subject:<wbr>unsubscribe</a><br>

>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-dev</a><br>

>>><br>

>> ______________________________<wbr>______________________________<wbr>______________<br>

>> OpenStack Development Mailing List (not for usage questions)<br>

>> Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.<wbr>openstack.org?subject:<wbr>unsubscribe</a><br>

>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-dev</a><br>

><br>

><br>

><br>

> ______________________________<wbr>______________________________<wbr>______________<br>

> OpenStack Development Mailing List (not for usage questions)<br>

> Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.<wbr>openstack.org?subject:<wbr>unsubscribe</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-dev</a><br>

><br>

<br>

______________________________<wbr>______________________________<wbr>______________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.<wbr>openstack.org?subject:<wbr>unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-dev</a><br>

</div></div></blockquote></div><br></div></div></div>