[openstack-dev] [nova][placement] Re: VMWare's resource pool / cluster and nested resource providers

Eric Fried openstack at fried.cc
Mon Jan 29 18:27:05 UTC 2018

We had some lively discussion in #openstack-nova today, which I'll try
to summarize here.

First of all, the hierarchy:

           controller (n-cond)
            /               \
         cluster/n-cpu     cluster/n-cpu
         /           \           /     \
     res. pool    res. pool     ...    ...
    /         \       /    \
 host       host     ...   ...
 /  \      /    \
... ...  inst  inst

Important points:

(1) Instances do indeed get deployed to individual hosts, BUT vCenter
can and does move them around within a cluster independent of nova-isms
like live migration.

(2) VMWare wants the ability to specify that an instance should be
deployed to a specific resource pool.

(3) VMWare accounts for resources at the level of the resource pool (not

(4) Hosts can move fluidly among resource pools.

(5) Conceptually, VMWare would like you not to see or think about the
'host' layer at all.

(6) It has been suggested that resource pools may be best represented
via aggregates.  But to satisfy (2), this would require support for
doing allocation requests that specify one (e.g. porting the GET
/resource_providers ?member_of=<agg> queryparam to GET
/allocation_candidates, and the corresponding flavor enhancements).  And
doing so would mean getting past our reluctance up to this point of
exposing aggregates by name/ID to users.

Here are some possible models:

(A) Today's model, where the cluster/n-cpu is represented as a single
provider owning all resources.  This requires some creative finagling of
inventory fields to ensure that a resource request might actually be
satisfied by a single host under this broad umbrella.  (An example cited
was to set VCPU's max_unit to whatever one host could provide.)  It is
not clear to me if/how resource pools have been represented in this
model thus far, or if/how it is currently possible to (2) target an
instance to a specific one.  I also don't see how anything we've done
with traits or aggregates would help with that aspect in this model.

(B) Representing each host as a root provider, each owning its own
actual inventory, each possessing a CUSTOM_RESOURCE_POOL_X trait
indicating which pool it belongs to at the moment; or representing pools
via aggregates as in (6).  This model breaks because of (1), unless we
give virt drivers some mechanism to modify allocations (e.g. via POST
/allocations) without doing an actual migration.

(C) Representing each resource pool as a root provider which presents
the collective inventory of all its hosts.  Each could possess its own
unique CUSTOM_RESOURCE_POOL_X trait.  Or we could possibly adapt
whatever mechanism Ironic uses when it targets a particular baremetal
node.  Or we could use aggregates as in (6), where each aggregate is
associated with just one provider.  This one breaks down because we
don't currently have a way for nova to know that, when an instance's
resources were allocated from the provider corresponding to resource
pool X, that means we should schedule the instance to (nova, n-cpu) host
Y.  There may be some clever solution for this involving aggregates (NOT
sharing providers!), but it has not been thought through.  It also
entails the same "creative finagling of inventory" described in (A).

(D) Using actual nested resource providers: the "cluster" is the
(inventory-less) root provider, and each resource pool is a child of the
cluster.  This is closest to representing the real logical hierarchy,
and is desirable for that reason.  The drawback is that you then MUST
use some mechanism to ensure allocations are never spread across pools.
If your request *always* targets a specific resource pool, that works.
Otherwise, you would have to use a numbered request group, as described
below.  It also entails the same "creative finagling of inventory"
described in (A).

(E) Take (D) a step further by adding each 'host' as a child of its
respective resource pool.  No "creative finagling", but same "moving
allocations" issue as (B).

I'm sure I've missed/misrepresented things.  Please correct and refine
as necessary.


On 01/27/2018 12:23 PM, Eric Fried wrote:
> Rado-
>     [+dev ML.  We're getting pretty general here; maybe others will get
> some use out of this.]
>> is there a way to make the scheduler allocate only from one specific RP
>     "...one specific RP" - is that Resource Provider or Resource Pool?
>     And are we talking about scheduling an instance to a specific
> compute node, or are we talking about making sure that all the requested
> resources are pulled from the same compute node (but it could be any one
> of several compute nodes)?  Or justlimiting the scheduler to any node in
> a specific resource pool?
>     To make sure I'm fully grasping the VMWare-specific
> ratios/relationships between resource pools and compute nodes,I have
> been assuming:
> controller 1:many compute "host"(where n-cpu runs)
> compute "host"  1:many resource pool
> resource pool 1:many compute "node" (where instances can be scheduled)
> compute "node" 1:many instance
>     (I don't know if this "host" vs"node" terminology is correct, but
> I'm going to keep pretending it is for the purposes of this note.)
>     In particular, if that last line is true, then you do *not* want
> multiple compute "nodes" in the same provider tree.
>> if no custom trait is specified in the request?
>     I am not aware of anything current or planned that will allow you to
> specify an aggregate you want to deploy from; so the only way I'm aware
> of that you could pin a request to a resource pool is to create a custom
> trait for that resource pool, tag all compute nodes in the pool with
> that trait, and specify that trait in your flavor.  This way you don't
> use nested-ness at all.  And in this model, there's also no need to
> create resource providers corresponding to resource pools - their
> solemanifestation is via traits.
>     (Bonus: this model will work with what we've got merged in Queens -
> we didn't quiiite finish the piece of NRP that makes them work for
> allocation candidates, but we did merge trait support.  We're also
> *mostly* there with aggregates, but I wouldn't want to rely on them
> working perfectly and we're not claiming full support for them.)
>     To be explicit, in the model I'm suggesting, your compute "host",
> within update_provider_tree, would create new_root()s for each compute
> "node".  So the "tree" isn't really a tree - it's a flat list of
> computes, of which one happens to correspond to the `nodename` and
> represents the compute "host".  (I assume deploys can happen to the
> compute "host" just like they can to a compute "node"?  If not, just
> give that guy no inventory and he'll be avoided.)  It would then
> update_traits(node, ['CUSTOM_RPOOL_X']) for each.  It would also
> update_inventory() for each as appropriate.
>     Now on your deploys, to get scheduled to a particular resource pool,
> you would have to specify required=CUSTOM_RPOOL_X in your flavor.
>     That's it.  You never use new_child().  There are no providers
> corresponding to pools.  There are no aggregates.
>     Are we making progress, or am I confused/confusing?
> Eric
> On 01/27/2018 01:50 AM, Radoslav Gerganov wrote:
>> +Chris
>> Hi Eric,
>> Thanks a lot for sending this.  I must admit that I am still trying to
>> catch up with how the scheduler (will) work when there are nested RPs,
>> traits, etc.  I thought mostly about the case when we use a custom
>> trait to force allocations only from one resource pool.  However, if
>> no trait is specified then we can end up in the situation that you
>> describe (allocating different resources from different resource
>> pools) and this is not what we want.  If we go with the model that you
>> propose, is there a way to make the scheduler allocate only from one
>> specific RP if no custom trait is specified in the request?
>> Thanks,
>> Rado
>> ------------------------------------------------------------------------
>> *From:* Eric Fried <openstack at fried.cc>
>> *Sent:* Friday, January 26, 2018 10:20 PM
>> *To:* Radoslav Gerganov
>> *Cc:* Jay Pipes
>> *Subject:* VMWare's resource pool / cluster and nested resource providers
>> Rado-
>>         It occurred to me just now that the model you described to me
>> [1] isn't
>> going to work, unless there's something I really misunderstood.
>>         The problem is that the placement API will think it can allocate
>> resources from anywhere in the tree for a given allocation request
>> (unless you always use a single numbered request group [2] in your
>> flavors, which doesn't sound like a clean plan).
>>         So if you have *any* model where multiple compute nodes reside
>> in the
>> same provider tree, and I come along with a request for say
>> VCPU:1,MEMORY_MB:2048,DISK_GB:512, placement will happily give you a
>> candidate with the VCPU from compute10, the memory from compute5, and
>> the disk from compute7.  I'm only guessing that this isn't a viable way
>> to boot an instance.
>>         I go back to my earlier suggestion: I think you need to create the
>> compute nodes as root providers in your ProviderTree, and find some
>> other way to mark the resource pool associations.  You could do it with
>> custom traits (CUSTOM_RESOURCE_POOL_X, ..._Y, etc.); or you could do it
>> with aggregates (an aggregate maps to a resource pool; associate all the
>> compute providers in a given pool with its aggregate uuid).
>>                         Thanks,
>>                         Eric
>> [1]
>> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-01-26.log.html#t2018-01-26T14:40:44
>> [2]
>> https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/granular-resource-requests.html#numbered-request-groups
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

More information about the OpenStack-dev mailing list