[openstack-dev] [cyborg] [nova] Cyborg quotas

Nadathur, Sundar sundar.nadathur at intel.com
Fri May 18 11:58:17 UTC 2018


Hi Matt,

On 5/17/2018 3:18 PM, Matt Riedemann wrote:
> On 5/17/2018 3:36 PM, Nadathur, Sundar wrote:
>> This applies only to the resources that Nova handles, IIUC, which 
>> does not handle accelerators. The generic method that Alex talks 
>> about is obviously preferable but, if that is not available in Rocky, 
>> is the filter an option?
>
> If nova isn't creating accelerator resources managed by cyborg, I have 
> no idea why nova would be doing quota checks on those types of 
> resources. And no, I don't think adding a scheduler filter to nova for 
> checking accelerator quota is something we'd add either. I'm not sure 
> that would even make sense - the quota for the resource is per tenant, 
> not per host is it? The scheduler filters work on a per-host basis.
Can we not extend BaseFilter.filter_all() to get all the hosts in a filter?
https://github.com/openstack/nova/blob/master/nova/filters.py#L36

I should have made it clearer that this putative filter will be 
out-of-tree, and needed only till better solutions become available.
>
> Like any other resource in openstack, the project that manages that 
> resource should be in charge of enforcing quota limits for it.
Agreed. Not sure how other projects handle it, but here's the situation 
for Cyborg. A request may get scheduled on a compute node with no 
intervention by Cyborg. So, the earliest check that can be made today is 
in the selected compute node. A simple approach can result in quota 
violations as in this example.

    Say there are 5 devices in a cluster. A tenant has a quota of 4 and
    is currently using 3. That leaves 2 unused devices, of which the
    tenant is permitted to use only one. But he may submit two
    concurrent requests, and they may land on two different compute
    nodes. The Cyborg agent in each node will see the current tenant
    usage as 3 and let the request go through, resulting in quota violation.

To prevent this, we need some kind of atomic update , like SQLAlchemy's 
with_lockmode():
https://wiki.openstack.org/wiki/OpenStack_and_SQLAlchemy#Pessimistic_Locking_-_SELECT_FOR_UPDATE 

That seems to have issues, as documented in the link above. Also, since 
every compute node does that, it would also serialize the bringup of all 
instances with accelerators, across the cluster.

If there is a better solution, I'll be happy to hear it.

Thanks,
Sundar




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20180518/ef98e6eb/attachment.html>


More information about the OpenStack-dev mailing list