[openstack-dev] [nova][placement] Re: VMWare's resource pool / cluster and nested resource providers

Giridhar Jayavelu gjayavelu at vmware.com
Mon Jan 29 18:56:04 UTC 2018

Response inline.

On 1/29/18, 10:27 AM, "Eric Fried" <openstack at fried.cc> wrote:

>We had some lively discussion in #openstack-nova today, which I'll try
>to summarize here.
>First of all, the hierarchy:
>           controller (n-cond)
>            /               \
>         cluster/n-cpu     cluster/n-cpu
>         /           \           /     \
>     res. pool    res. pool     ...    ...
>    /         \       /    \
> host       host     ...   ...
> /  \      /    \
>... ...  inst  inst
>Important points:
>(1) Instances do indeed get deployed to individual hosts, BUT vCenter
>can and does move them around within a cluster independent of nova-isms
>like live migration.
>(2) VMWare wants the ability to specify that an instance should be
>deployed to a specific resource pool.
>(3) VMWare accounts for resources at the level of the resource pool (not
>(4) Hosts can move fluidly among resource pools.
>(5) Conceptually, VMWare would like you not to see or think about the
>'host' layer at all.
>(6) It has been suggested that resource pools may be best represented
>via aggregates.  But to satisfy (2), this would require support for
>doing allocation requests that specify one (e.g. porting the GET
>/resource_providers ?member_of=<agg> queryparam to GET
>/allocation_candidates, and the corresponding flavor enhancements).  And
>doing so would mean getting past our reluctance up to this point of
>exposing aggregates by name/ID to users.
>Here are some possible models:
>(A) Today's model, where the cluster/n-cpu is represented as a single
>provider owning all resources.  This requires some creative finagling of
>inventory fields to ensure that a resource request might actually be
>satisfied by a single host under this broad umbrella.  (An example cited
>was to set VCPU's max_unit to whatever one host could provide.)  It is
>not clear to me if/how resource pools have been represented in this
>model thus far, or if/how it is currently possible to (2) target an
>instance to a specific one.  I also don't see how anything we've done
>with traits or aggregates would help with that aspect in this model.
>(B) Representing each host as a root provider, each owning its own
>actual inventory, each possessing a CUSTOM_RESOURCE_POOL_X trait
>indicating which pool it belongs to at the moment; or representing pools
>via aggregates as in (6).  This model breaks because of (1), unless we
>give virt drivers some mechanism to modify allocations (e.g. via POST
>/allocations) without doing an actual migration.
>(C) Representing each resource pool as a root provider which presents
>the collective inventory of all its hosts.  Each could possess its own
>unique CUSTOM_RESOURCE_POOL_X trait.  Or we could possibly adapt
>whatever mechanism Ironic uses when it targets a particular baremetal
>node.  Or we could use aggregates as in (6), where each aggregate is
>associated with just one provider.  This one breaks down because we
>don't currently have a way for nova to know that, when an instance's
>resources were allocated from the provider corresponding to resource
>pool X, that means we should schedule the instance to (nova, n-cpu) host
>Y.  There may be some clever solution for this involving aggregates (NOT
>sharing providers!), but it has not been thought through.  It also
>entails the same "creative finagling of inventory" described in (A).
>(D) Using actual nested resource providers: the "cluster" is the
>(inventory-less) root provider, and each resource pool is a child of the
>cluster.  This is closest to representing the real logical hierarchy,
>and is desirable for that reason.  The drawback is that you then MUST
>use some mechanism to ensure allocations are never spread across pools.
>If your request *always* targets a specific resource pool, that works.
>Otherwise, you would have to use a numbered request group, as described
>below.  It also entails the same "creative finagling of inventory"
>described in (A).
I think nested resource provider is better option for another reason. Every
resource pool could have it's own limits. So, it is important to track the
allocations/usage and ensure that the scheduler can throw error if there are
no sufficient resources on the vcenter resource pool. NOTE: a vcenter cluster,
which compute node, might have more capacity left. But, resource pool limit
could prevent placing a VM on that pool. And yes, the request would always
target a specific resource pool.

>(E) Take (D) a step further by adding each 'host' as a child of its
>respective resource pool.  No "creative finagling", but same "moving
>allocations" issue as (B).

This might not work because resource pool is a logical construct. They may not
exist under vcenter cluster too. Vms can be placed
on vcenter cluster with or without resource pool. 

>I'm sure I've missed/misrepresented things.  Please correct and refine
>as necessary.


>On 01/27/2018 12:23 PM, Eric Fried wrote:
>> Rado-
>>     [+dev ML.  We're getting pretty general here; maybe others will get
>> some use out of this.]
>>> is there a way to make the scheduler allocate only from one specific RP
>>     "...one specific RP" - is that Resource Provider or Resource Pool?
>>     And are we talking about scheduling an instance to a specific
>> compute node, or are we talking about making sure that all the requested
>> resources are pulled from the same compute node (but it could be any one
>> of several compute nodes)?  Or justlimiting the scheduler to any node in
>> a specific resource pool?
>>     To make sure I'm fully grasping the VMWare-specific
>> ratios/relationships between resource pools and compute nodes,I have
>> been assuming:
>> controller 1:many compute "host"(where n-cpu runs)
>> compute "host"  1:many resource pool
>> resource pool 1:many compute "node" (where instances can be scheduled)
>> compute "node" 1:many instance
>>     (I don't know if this "host" vs"node" terminology is correct, but
>> I'm going to keep pretending it is for the purposes of this note.)
>>     In particular, if that last line is true, then you do *not* want
>> multiple compute "nodes" in the same provider tree.
>>> if no custom trait is specified in the request?
>>     I am not aware of anything current or planned that will allow you to
>> specify an aggregate you want to deploy from; so the only way I'm aware
>> of that you could pin a request to a resource pool is to create a custom
>> trait for that resource pool, tag all compute nodes in the pool with
>> that trait, and specify that trait in your flavor.  This way you don't
>> use nested-ness at all.  And in this model, there's also no need to
>> create resource providers corresponding to resource pools - their
>> solemanifestation is via traits.
>>     (Bonus: this model will work with what we've got merged in Queens -
>> we didn't quiiite finish the piece of NRP that makes them work for
>> allocation candidates, but we did merge trait support.  We're also
>> *mostly* there with aggregates, but I wouldn't want to rely on them
>> working perfectly and we're not claiming full support for them.)
>>     To be explicit, in the model I'm suggesting, your compute "host",
>> within update_provider_tree, would create new_root()s for each compute
>> "node".  So the "tree" isn't really a tree - it's a flat list of
>> computes, of which one happens to correspond to the `nodename` and
>> represents the compute "host".  (I assume deploys can happen to the
>> compute "host" just like they can to a compute "node"?  If not, just
>> give that guy no inventory and he'll be avoided.)  It would then
>> update_traits(node, ['CUSTOM_RPOOL_X']) for each.  It would also
>> update_inventory() for each as appropriate.
>>     Now on your deploys, to get scheduled to a particular resource pool,
>> you would have to specify required=CUSTOM_RPOOL_X in your flavor.
>>     That's it.  You never use new_child().  There are no providers
>> corresponding to pools.  There are no aggregates.
>>     Are we making progress, or am I confused/confusing?
>> Eric
>> On 01/27/2018 01:50 AM, Radoslav Gerganov wrote:
>>> +Chris
>>> Hi Eric,
>>> Thanks a lot for sending this.  I must admit that I am still trying to
>>> catch up with how the scheduler (will) work when there are nested RPs,
>>> traits, etc.  I thought mostly about the case when we use a custom
>>> trait to force allocations only from one resource pool.  However, if
>>> no trait is specified then we can end up in the situation that you
>>> describe (allocating different resources from different resource
>>> pools) and this is not what we want.  If we go with the model that you
>>> propose, is there a way to make the scheduler allocate only from one
>>> specific RP if no custom trait is specified in the request?
>>> Thanks,
>>> Rado
>>> ------------------------------------------------------------------------
>>> *From:* Eric Fried <openstack at fried.cc>
>>> *Sent:* Friday, January 26, 2018 10:20 PM
>>> *To:* Radoslav Gerganov
>>> *Cc:* Jay Pipes
>>> *Subject:* VMWare's resource pool / cluster and nested resource providers
>>> Rado-
>>>         It occurred to me just now that the model you described to me
>>> [1] isn't
>>> going to work, unless there's something I really misunderstood.
>>>         The problem is that the placement API will think it can allocate
>>> resources from anywhere in the tree for a given allocation request
>>> (unless you always use a single numbered request group [2] in your
>>> flavors, which doesn't sound like a clean plan).
>>>         So if you have *any* model where multiple compute nodes reside
>>> in the
>>> same provider tree, and I come along with a request for say
>>> VCPU:1,MEMORY_MB:2048,DISK_GB:512, placement will happily give you a
>>> candidate with the VCPU from compute10, the memory from compute5, and
>>> the disk from compute7.  I'm only guessing that this isn't a viable way
>>> to boot an instance.
>>>         I go back to my earlier suggestion: I think you need to create the
>>> compute nodes as root providers in your ProviderTree, and find some
>>> other way to mark the resource pool associations.  You could do it with
>>> custom traits (CUSTOM_RESOURCE_POOL_X, ..._Y, etc.); or you could do it
>>> with aggregates (an aggregate maps to a resource pool; associate all the
>>> compute providers in a given pool with its aggregate uuid).
>>>                         Thanks,
>>>                         Eric
>>> [1]
>>> http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-01-26.log.html#t2018-01-26T14:40:44
>>> [2]
>>> https://specs.openstack.org/openstack/nova-specs/specs/queens/approved/granular-resource-requests.html#numbered-request-groups
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe

More information about the OpenStack-dev mailing list