[openstack-dev] [nova][scheduler][placement] Trying to understand the proposed direction
sfinucan at redhat.com
sfinucan at redhat.com
Tue Jun 20 14:09:29 UTC 2017
On Mon, 2017-06-19 at 09:36 -0500, Matt Riedemann wrote:
> On 6/19/2017 9:17 AM, Jay Pipes wrote:
> > On 06/19/2017 09:04 AM, Edward Leafe wrote:
> > > Current flow:
>
> As noted in the nova-scheduler meeting this morning, this should have
> been called "original plan" rather than "current flow", as Jay pointed
> out inline.
>
> > > * Scheduler gets a req spec from conductor, containing resource
> > > requirements
> > > * Scheduler sends those requirements to placement
> > > * Placement runs a query to determine the root RPs that can satisfy
> > > those requirements
> >
> > Not root RPs. Non-sharing resource providers, which currently
> > effectively means compute node providers. Nested resource providers
> > isn't yet merged, so there is currently no concept of a hierarchy of
> > providers.
> >
> > > * Placement returns a list of the UUIDs for those root providers to
> > > scheduler
> >
> > It returns the provider names and UUIDs, yes.
> >
> > > * Scheduler uses those UUIDs to create HostState objects for each
> >
> > Kind of. The scheduler calls ComputeNodeList.get_all_by_uuid(), passing
> > in a list of the provider UUIDs it got back from the placement service.
> > The scheduler then builds a set of HostState objects from the results of
> > ComputeNodeList.get_all_by_uuid().
> >
> > The scheduler also keeps a set of AggregateMetadata objects in memory,
> > including the association of aggregate to host (note: this is the
> > compute node's *service*, not the compute node object itself, thus the
> > reason aggregates don't work properly for Ironic nodes).
> >
> > > * Scheduler runs those HostState objects through filters to remove
> > > those that don't meet requirements not selected for by placement
> >
> > Yep.
> >
> > > * Scheduler runs the remaining HostState objects through weighers to
> > > order them in terms of best fit.
> >
> > Yep.
> >
> > > * Scheduler takes the host at the top of that ranked list, and tries
> > > to claim the resources in placement. If that fails, there is a race,
> > > so that HostState is discarded, and the next is selected. This is
> > > repeated until the claim succeeds.
> >
> > No, this is not how things work currently. The scheduler does not claim
> > resources. It selects the top (or random host depending on the selection
> > strategy) and sends the launch request to the target compute node. The
> > target compute node then attempts to claim the resources and in doing so
> > writes records to the compute_nodes table in the Nova cell database as
> > well as the Placement API for the compute node resource provider.
>
> Not to nit pick, but today the scheduler sends the selected destinations
> to the conductor. Conductor looks up the cell that a selected host is
> in, creates the instance record and friends (bdms) in that cell and then
> sends the build request to the compute host in that cell.
>
> >
> > > * Scheduler then creates a list of N UUIDs, with the first being the
> > > selected host, and the the rest being alternates consisting of the
> > > next hosts in the ranked list that are in the same cell as the
> > > selected host.
> >
> > This isn't currently how things work, no. This has been discussed, however.
> >
> > > * Scheduler returns that list to conductor.
> > > * Conductor determines the cell of the selected host, and sends that
> > > list to the target cell.
> > > * Target cell tries to build the instance on the selected host. If it
> > > fails, it unclaims the resources for the selected host, and tries to
> > > claim the resources for the next host in the list. It then tries to
> > > build the instance on the next host in the list of alternates. Only
> > > when all alternates fail does the build request fail.
> >
> > This isn't currently how things work, no. There has been discussion of
> > having the compute node retry alternatives locally, but nothing more
> > than discussion.
>
> Correct that this isn't how things currently work, but it was/is the
> original plan. And the retry happens within the cell conductor, not on
> the compute node itself. The top-level conductor is what's getting
> selected hosts from the scheduler. The cell-level conductor is what's
> getting a retry request from the compute. The cell-level conductor would
> deallocate from placement for the currently claimed providers, and then
> pick one of the alternatives passed down from the top and then make
> allocations (a claim) against those, then send to an alternative compute
> host for another build attempt.
>
> So with this plan, there are two places to make allocations - the
> scheduler first, and then the cell conductors for retries. This
> duplication is why some people were originally pushing to move all
> allocation-related work happen in the conductor service.
>
> > > Proposed flow:
> > > * Scheduler gets a req spec from conductor, containing resource
> > > requirements
> > > * Scheduler sends those requirements to placement
> > > * Placement runs a query to determine the root RPs that can satisfy
> > > those requirements
> >
> > Yes.
> >
> > > * Placement then constructs a data structure for each root provider as
> > > documented in the spec. [0]
> >
> > Yes.
> >
> > > * Placement returns a number of these data structures as JSON blobs.
> > > Due to the size of the data, a page size will have to be determined,
> > > and placement will have to either maintain that list of structured
> > > datafor subsequent requests, or re-run the query and only calculate
> > > the data structures for the hosts that fit in the requested page.
> >
> > "of these data structures as JSON blobs" is kind of redundant... all our
> > REST APIs return data structures as JSON blobs.
> >
> > While we discussed the fact that there may be a lot of entries, we did
> > not say we'd immediately support a paging mechanism.
>
> I believe we said in the initial version we'd have the configurable
> limit in the DB API queries, like we have today - the default limit is
> 1000. There was agreement to eventually build paging support into the API.
>
> This does make me wonder though what happens when you have 100K or more
> compute nodes reporting into placement and we limit on the first 1000.
> Aren't we going to be imposing a packing strategy then just because of
> how we pull things out of the database for Placement? Although I don't
> see how that would be any different from before we had Placement and the
> nova-scheduler service just did a ComputeNode.get_all() to the nova DB
> and then filtered/weighed those objects.
>
> > > * Scheduler continues to request the paged results until it has them all.
> >
> > See above. Was discussed briefly as a concern but not work to do for
> > first patches.
> >
> > > * Scheduler then runs this data through the filters and weighers. No
> > > HostState objects are required, as the data structures will contain
> > > all the information that scheduler will need.
> >
> > No, this isn't correct. The scheduler will have *some* of the
> > information it requires for weighing from the returned data from the GET
> > /allocation_candidates call, but not all of it.
> >
> > Again, operators have insisted on keeping the flexibility currently in
> > the Nova scheduler to weigh/sort compute nodes by things like thermal
> > metrics and kinds of data that the Placement API will never be
> > responsible for.
> >
> > The scheduler will need to merge information from the
> > "provider_summaries" part of the HTTP response with information it has
> > already in its HostState objects (gotten from
> > ComputeNodeList.get_all_by_uuid() and AggregateMetadataList).
> >
> > > * Scheduler then selects the data structure at the top of the ranked
> > > list. Inside that structure is a dict of the allocation data that
> > > scheduler will need to claim the resources on the selected host. If
> > > the claim fails, the next data structure in the list is chosen, and
> > > repeated until a claim succeeds.
> >
> > Kind of, yes. The scheduler will select a *host* that meets its needs.
> >
> > There may be more than one allocation request that includes that host
> > resource provider, because of shared providers and (soon) nested
> > providers. The scheduler will choose one of these allocation requests
> > and attempt a claim of resources by simply PUT
> > /allocations/{instance_uuid} with the serialized body of that allocation
> > request. If 202 returned, cool. If not, repeat for the next allocation
> > request.
> >
> > > * Scheduler then creates a list of N of these data structures, with
> > > the first being the data for the selected host, and the the rest being
> > > data structures representing alternates consisting of the next hosts
> > > in the ranked list that are in the same cell as the selected host.
> >
> > Yes, this is the proposed solution for allowing retries within a cell.
> >
> > > * Scheduler returns that list to conductor.
> > > * Conductor determines the cell of the selected host, and sends that
> > > list to the target cell.
> > > * Target cell tries to build the instance on the selected host. If it
> > > fails, it uses the allocation data in the data structure to unclaim
> > > the resources for the selected host, and tries to claim the resources
> > > for the next host in the list using its allocation data. It then tries
> > > to build the instance on the next host in the list of alternates. Only
> > > when all alternates fail does the build request fail.
> >
> > I'll let Dan discuss this last part.
> >
> > Best,
> > -jay
> >
> > > [0] https://review.openstack.org/#/c/471927/
I have a document (with a nifty activity diagram in tow) for all the above
available here:
https://review.openstack.org/475810
Should be more Google'able that mailing list posts for future us :)
Stephen
More information about the OpenStack-dev
mailing list