[openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

Jay Pipes jaypipes at gmail.com
Fri Jun 9 21:35:35 UTC 2017


Sorry, been in a three-hour meeting. Comments inline...

On 06/06/2017 10:56 AM, Chris Dent wrote:
> On Mon, 5 Jun 2017, Ed Leafe wrote:
>
>> One proposal is to essentially use the same logic in placement
>> that was used to include that host in those matching the
>> requirements. In other words, when it tries to allocate the amount
>> of disk, it would determine that that host is in a shared storage
>> aggregate, and be smart enough to allocate against that provider.
>> This was referred to in our discussion as "Plan A".
>
> What would help for me is greater explanation of if and if so, how and
> why, "Plan A" doesn't work for nested resource providers.

We'd have to add all the sorting/weighing logic from the existing 
scheduler into the Placement API. Otherwise, the Placement API won't 
understand which child provider to pick out of many providers that meet 
resource/trait requirements.

> We can declare that allocating for shared disk is fairly deterministic
> if we assume that any given compute node is only associated with one
> shared disk provider.

a) we can't assume that
b) a compute node could very well have both local disk and shared disk. 
how would the placement API know which one to pick? This is a 
sorting/weighing decision and thus is something the scheduler is 
responsible for.

> My understanding is this determinism is not the case with nested
> resource providers because there's some fairly late in the game
> choosing of which pci device or which numa cell is getting used.
> The existing resource tracking doesn't have this problem because the
> claim of those resources is made very late in the game. < Is this
> correct?

No, it's not about determinism or how late in the game a claim decision 
is made. It's really just that the scheduler is the thing that does 
sorting/weighing, not the placement API. We made this decision due to 
the operator feedback that they were not willing to give up their 
ability to add custom weighers and be able to have scheduling policies 
that rely on transient data like thermal metrics collection.

> The problem comes into play when we want to claim from the scheduler
> (or conductor). Additional information is required to choose which
> child providers to use. <- Is this correct?

Correct.

> Plan B overcomes the information deficit by including more
> information in the response from placement (as straw-manned in the
> etherpad [1]) allowing code in the filter scheduler to make accurate
> claims. <- Is this correct?

Partly, yes. But, more than anything it's about the placement API 
returning resource provider UUIDs for child providers and sharing 
providers so that the scheduler, when it picks one of those SRIOV 
physical functions, or NUMA cells, or shared storage pools, has the 
identifier with which to tell the placement API "ok, claim *this* 
resource against *this* provider".

> * We already have the information the filter scheduler needs now by
>   some other means, right?  What are the reasons we don't want to
>   use that anymore?

The filter scheduler has most of the information, yes. What it doesn't 
have is the *identifier* (UUID) for things like SRIOV PFs or NUMA cells 
that the Placement API will use to distinguish between things. In other 
words, the filter scheduler currently does things like unpack a 
NUMATopology object into memory and determine a NUMA cell to place an 
instance to. However, it has no concept that that NUMA cell is (or will 
soon be once nested-resource-providers is done) a resource provider in 
the placement API. Same for SRIOV PFs. Same for VGPUs. Same for FPGAs, 
etc. That's why we need to return information to the scheduler from the 
placement API that will allow the scheduler to understand "hey, this 
NUMA cell on compute node X is resource provider $UUID".

> * Part of the reason for having nested resource providers is because
>   it can allow affinity/anti-affinity below the compute node (e.g.,
>   workloads on the same host but different numa cells).

Mmm, kinda, yeah.

 >  If I
>   remember correctly, the modelling and tracking of this kind of
>   information in this way comes out of the time when we imagined the
>   placement service would be doing considerably more filtering than
>   is planned now. Plan B appears to be an acknowledgement of "on
>   some of this stuff, we can't actually do anything but provide you
>   some info, you need to decide".

Not really. Filtering is still going to be done in the placement API. 
It's the thing that says "hey, these providers (or trees of providers) 
meet these resource and trait requirements". The scheduler however is 
what takes that set of filtered providers and does its sorting/weighing 
magic and selects one.

 > If that's the case, is the
>   topological modelling on the placement DB side of things solely a
>   convenient place to store information? If there were some other
>   way to model that topology could things currently being considered
>   for modelling as nested providers be instead simply modelled as
>   inventories of a particular class of resource?
>   (I'm not suggesting we do this, rather that the answer that says
>   why we don't want to do this is useful for understanding the
>   picture.)

The modeling of the topologies of providers in the placement API/DB is 
strictly to ensure consistency and correctness of representation. We're 
modeling the actual relationship between resource providers in a generic 
way and not embedding that topology information in a variety of JSON 
blobs and other structs in the cell database.

> * Does a claim made in the scheduler need to be complete? Is there
>   value in making a partial claim from the scheduler that consumes a
>   vcpu and some ram, and then in the resource tracker is corrected
>   to consume a specific pci device, numa cell, gpu and/or fpga?
>   Would this be better or worse than what we have now? Why?

Good question. I think the answer to this is probably pretty theoretical 
at this point. My gut instinct is that we should treat the consumption 
of resources in an atomic fashion, and that transactional nature of 
allocation will result in fewer race conditions and cleaner code. But, 
admittedly, this is just my gut reaction.

> * What is lacking in placement's representation of resource providers
>   that makes it difficult or impossible for an allocation against a
>   parent provider to be able to determine the correct child
>   providers to which to cascade some of the allocation? (And by
>   extension make the earlier scheduling decision.)

See above. The sorting/weighing logic, which is very much 
deployer-defined and wreaks of customization, is what would need to be 
added to the placement API.

best,
-jay

> That's a start. With answers to at last some of these questions I
> think the straw man in the etherpad can be more effectively
> evaluated. As things stand right now it is a proposed solution
> without a clear problem statement. I feel like we could do with a
> more clear problem statement.
>
> Thanks.
>
> [1] https://etherpad.openstack.org/p/placement-allocations-straw-man
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>




More information about the OpenStack-dev mailing list