[openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

Edward Leafe ed at leafe.com
Mon Jun 19 17:59:56 UTC 2017

On Jun 19, 2017, at 9:17 AM, Jay Pipes <jaypipes at gmail.com> wrote:

As Matt pointed out, I mis-wrote when I said “current flow”. I meant “current agreed-to design flow”. So no need to rehash that.

>> * Placement returns a number of these data structures as JSON blobs. Due to the size of the data, a page size will have to be determined, and placement will have to either maintain that list of structured datafor subsequent requests, or re-run the query and only calculate the data structures for the hosts that fit in the requested page.
> "of these data structures as JSON blobs" is kind of redundant... all our REST APIs return data structures as JSON blobs.

Well, I was trying to be specific. I didn’t mean to imply that this was a radical departure or anything.

> While we discussed the fact that there may be a lot of entries, we did not say we'd immediately support a paging mechanism.

OK, thanks for clarifying that. When we discussed returning 1.5K per compute host instead of a couple of hundred bytes, there was discussion that paging would be necessary.

>> * Scheduler continues to request the paged results until it has them all.
> See above. Was discussed briefly as a concern but not work to do for first patches.
>> * Scheduler then runs this data through the filters and weighers. No HostState objects are required, as the data structures will contain all the information that scheduler will need.
> No, this isn't correct. The scheduler will have *some* of the information it requires for weighing from the returned data from the GET /allocation_candidates call, but not all of it.
> Again, operators have insisted on keeping the flexibility currently in the Nova scheduler to weigh/sort compute nodes by things like thermal metrics and kinds of data that the Placement API will never be responsible for.
> The scheduler will need to merge information from the "provider_summaries" part of the HTTP response with information it has already in its HostState objects (gotten from ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).

OK, that’s informative, too. Is there anything decided on how much host info will be in the response from placement, and how much will be in HostState? Or how the reporting of resources by the compute nodes will have to change to feed this information to placement? Or how the two sources of information will be combined so that the filters and weighers can process it? Or is that still to be worked out?

>> * Scheduler then selects the data structure at the top of the ranked list. Inside that structure is a dict of the allocation data that scheduler will need to claim the resources on the selected host. If the claim fails, the next data structure in the list is chosen, and repeated until a claim succeeds.
> Kind of, yes. The scheduler will select a *host* that meets its needs.
> There may be more than one allocation request that includes that host resource provider, because of shared providers and (soon) nested providers. The scheduler will choose one of these allocation requests and attempt a claim of resources by simply PUT /allocations/{instance_uuid} with the serialized body of that allocation request. If 202 returned, cool. If not, repeat for the next allocation request.

Ah, yes, good point. A host with multiple nested providers, or with shared and local storage, will have to have multiple copies of the data structure returned to reflect those permutations. 

>> * Scheduler then creates a list of N of these data structures, with the first being the data for the selected host, and the the rest being data structures representing alternates consisting of the next hosts in the ranked list that are in the same cell as the selected host.
> Yes, this is the proposed solution for allowing retries within a cell.


>> * Scheduler returns that list to conductor.
>> * Conductor determines the cell of the selected host, and sends that list to the target cell.
>> * Target cell tries to build the instance on the selected host. If it fails, it uses the allocation data in the data structure to unclaim the resources for the selected host, and tries to claim the resources for the next host in the list using its allocation data. It then tries to build the instance on the next host in the list of alternates. Only when all alternates fail does the build request fail.
> I'll let Dan discuss this last part.

Well, that’s not substantially different than the original plan, so no additional explanation is required.

One other thing: since this new functionality is exposed via a new API call, is the existing method of filtering RPs by passing in resources going to be deprecated? And the code for adding filtering by traits to that also no longer useful?

-- Ed Leafe

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170619/e82dcffd/attachment.html>

More information about the OpenStack-dev mailing list