[openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction

sfinucan at redhat.com sfinucan at redhat.com
Tue Jun 20 14:09:29 UTC 2017


On Mon, 2017-06-19 at 09:36 -0500, Matt Riedemann wrote:
> On 6/19/2017 9:17 AM, Jay Pipes wrote:
> > On 06/19/2017 09:04 AM, Edward Leafe wrote:
> > > Current flow:
> 
> As noted in the nova-scheduler meeting this morning, this should have 
> been called "original plan" rather than "current flow", as Jay pointed 
> out inline.
> 
> > > * Scheduler gets a req spec from conductor, containing resource 
> > > requirements
> > > * Scheduler sends those requirements to placement
> > > * Placement runs a query to determine the root RPs that can satisfy 
> > > those requirements
> > 
> > Not root RPs. Non-sharing resource providers, which currently 
> > effectively means compute node providers. Nested resource providers 
> > isn't yet merged, so there is currently no concept of a hierarchy of 
> > providers.
> > 
> > > * Placement returns a list of the UUIDs for those root providers to 
> > > scheduler
> > 
> > It returns the provider names and UUIDs, yes.
> > 
> > > * Scheduler uses those UUIDs to create HostState objects for each
> > 
> > Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing 
> > in a list of the provider UUIDs it got back from the placement service. 
> > The scheduler then builds a set of HostState objects from the results of 
> > ComputeNodeList.get_all_by_uuid().
> > 
> > The scheduler also keeps a set of AggregateMetadata objects in memory, 
> > including the association of aggregate to host (note: this is the 
> > compute node's *service*, not the compute node object itself, thus the 
> > reason aggregates don't work properly for Ironic nodes).
> > 
> > > * Scheduler runs those HostState objects through filters to remove 
> > > those that don't meet requirements not selected for by placement
> > 
> > Yep.
> > 
> > > * Scheduler runs the remaining HostState objects through weighers to 
> > > order them in terms of best fit.
> > 
> > Yep.
> > 
> > > * Scheduler takes the host at the top of that ranked list, and tries 
> > > to claim the resources in placement. If that fails, there is a race, 
> > > so that HostState is discarded, and the next is selected. This is 
> > > repeated until the claim succeeds.
> > 
> > No, this is not how things work currently. The scheduler does not claim 
> > resources. It selects the top (or random host depending on the selection 
> > strategy) and sends the launch request to the target compute node. The 
> > target compute node then attempts to claim the resources and in doing so 
> > writes records to the compute_nodes table in the Nova cell database as 
> > well as the Placement API for the compute node resource provider.
> 
> Not to nit pick, but today the scheduler sends the selected destinations 
> to the conductor. Conductor looks up the cell that a selected host is 
> in, creates the instance record and friends (bdms) in that cell and then 
> sends the build request to the compute host in that cell.
> 
> > 
> > > * Scheduler then creates a list of N UUIDs, with the first being the 
> > > selected host, and the the rest being alternates consisting of the 
> > > next hosts in the ranked list that are in the same cell as the 
> > > selected host.
> > 
> > This isn't currently how things work, no. This has been discussed, however.
> > 
> > > * Scheduler returns that list to conductor.
> > > * Conductor determines the cell of the selected host, and sends that 
> > > list to the target cell.
> > > * Target cell tries to build the instance on the selected host. If it 
> > > fails, it unclaims the resources for the selected host, and tries to 
> > > claim the resources for the next host in the list. It then tries to 
> > > build the instance on the next host in the list of alternates. Only 
> > > when all alternates fail does the build request fail.
> > 
> > This isn't currently how things work, no. There has been discussion of 
> > having the compute node retry alternatives locally, but nothing more 
> > than discussion.
> 
> Correct that this isn't how things currently work, but it was/is the 
> original plan. And the retry happens within the cell conductor, not on 
> the compute node itself. The top-level conductor is what's getting 
> selected hosts from the scheduler. The cell-level conductor is what's 
> getting a retry request from the compute. The cell-level conductor would 
> deallocate from placement for the currently claimed providers, and then 
> pick one of the alternatives passed down from the top and then make 
> allocations (a claim) against those, then send to an alternative compute 
> host for another build attempt.
> 
> So with this plan, there are two places to make allocations - the 
> scheduler first, and then the cell conductors for retries. This 
> duplication is why some people were originally pushing to move all 
> allocation-related work happen in the conductor service.
> 
> > > Proposed flow:
> > > * Scheduler gets a req spec from conductor, containing resource 
> > > requirements
> > > * Scheduler sends those requirements to placement
> > > * Placement runs a query to determine the root RPs that can satisfy 
> > > those requirements
> > 
> > Yes.
> > 
> > > * Placement then constructs a data structure for each root provider as 
> > > documented in the spec. [0]
> > 
> > Yes.
> > 
> > > * Placement returns a number of these data structures as JSON blobs. 
> > > Due to the size of the data, a page size will have to be determined, 
> > > and placement will have to either maintain that list of structured 
> > > datafor subsequent requests, or re-run the query and only calculate 
> > > the data structures for the hosts that fit in the requested page.
> > 
> > "of these data structures as JSON blobs" is kind of redundant... all our 
> > REST APIs return data structures as JSON blobs.
> > 
> > While we discussed the fact that there may be a lot of entries, we did 
> > not say we'd immediately support a paging mechanism.
> 
> I believe we said in the initial version we'd have the configurable 
> limit in the DB API queries, like we have today - the default limit is 
> 1000. There was agreement to eventually build paging support into the API.
> 
> This does make me wonder though what happens when you have 100K or more 
> compute nodes reporting into placement and we limit on the first 1000. 
> Aren't we going to be imposing a packing strategy then just because of 
> how we pull things out of the database for Placement? Although I don't 
> see how that would be any different from before we had Placement and the 
> nova-scheduler service just did a ComputeNode.get_all() to the nova DB 
> and then filtered/weighed those objects.
> 
> > > * Scheduler continues to request the paged results until it has them all.
> > 
> > See above. Was discussed briefly as a concern but not work to do for 
> > first patches.
> > 
> > > * Scheduler then runs this data through the filters and weighers. No 
> > > HostState objects are required, as the data structures will contain 
> > > all the information that scheduler will need.
> > 
> > No, this isn't correct. The scheduler will have *some* of the 
> > information it requires for weighing from the returned data from the GET 
> > /allocation_candidates call, but not all of it.
> > 
> > Again, operators have insisted on keeping the flexibility currently in 
> > the Nova scheduler to weigh/sort compute nodes by things like thermal 
> > metrics and kinds of data that the Placement API will never be 
> > responsible for.
> > 
> > The scheduler will need to merge information from the 
> > "provider_summaries" part of the HTTP response with information it has 
> > already in its HostState objects (gotten from 
> > ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).
> > 
> > > * Scheduler then selects the data structure at the top of the ranked 
> > > list. Inside that structure is a dict of the allocation data that 
> > > scheduler will need to claim the resources on the selected host. If 
> > > the claim fails, the next data structure in the list is chosen, and 
> > > repeated until a claim succeeds.
> > 
> > Kind of, yes. The scheduler will select a *host* that meets its needs.
> > 
> > There may be more than one allocation request that includes that host 
> > resource provider, because of shared providers and (soon) nested 
> > providers. The scheduler will choose one of these allocation requests 
> > and attempt a claim of resources by simply PUT 
> > /allocations/{instance_uuid} with the serialized body of that allocation 
> > request. If 202 returned, cool. If not, repeat for the next allocation 
> > request.
> > 
> > > * Scheduler then creates a list of N of these data structures, with 
> > > the first being the data for the selected host, and the the rest being 
> > > data structures representing alternates consisting of the next hosts 
> > > in the ranked list that are in the same cell as the selected host.
> > 
> > Yes, this is the proposed solution for allowing retries within a cell.
> > 
> > > * Scheduler returns that list to conductor.
> > > * Conductor determines the cell of the selected host, and sends that 
> > > list to the target cell.
> > > * Target cell tries to build the instance on the selected host. If it 
> > > fails, it uses the allocation data in the data structure to unclaim 
> > > the resources for the selected host, and tries to claim the resources 
> > > for the next host in the list using its allocation data. It then tries 
> > > to build the instance on the next host in the list of alternates. Only 
> > > when all alternates fail does the build request fail.
> > 
> > I'll let Dan discuss this last part.
> > 
> > Best,
> > -jay
> > 
> > > [0] https://review.openstack.org/#/c/471927/

I have a document (with a nifty activity diagram in tow) for all the above
available here:

  https://review.openstack.org/475810 

Should be more Google'able that mailing list posts for future us :)

Stephen



More information about the OpenStack-dev mailing list