[openstack-dev] [nova] Update on scheduler and resource tracker progress

Clint Byrum clint at fewbar.com
Fri Feb 19 00:16:28 UTC 2016

Excerpts from Jay Pipes's message of 2016-02-18 11:33:04 -0800:
> On 02/12/2016 01:47 PM, Clint Byrum wrote:
> > Excerpts from Jay Pipes's message of 2016-02-11 12:24:04 -0800:
> >> Hello all,
> >>
> >> Performance working group, please pay attention to Chapter 2 in the
> >> details section.
> >>
> >
> > <snipped the part you let us not pay attention to. ;)>
> >
> >> Chapter 2 - Addressing performance and scale
> >> ============================================
> >>
> >> One of the significant performance problems with the Nova scheduler is
> >> the fact that for every call to the select_destinations() RPC API method
> >> -- which itself is called at least once every time a launch or migration
> >> request is made -- the scheduler grabs all records for all compute nodes
> >> in the deployment. Once retrieving all these compute node records, the
> >> scheduler runs each through a set of filters to determine which compute
> >> nodes have the required capacity to service the instance's requested
> >> resources. Having the scheduler continually retrieve every compute node
> >> record on each request to select_destinations() is extremely
> >> inefficient. The greater the number of compute nodes, the bigger the
> >> performance and scale problem this becomes.
> >>
> >> On a loaded cloud deployment -- say there are 1000 compute nodes and 900
> >> of them are fully loaded with active virtual machines -- the scheduler
> >> is still going to retrieve all 1000 compute node records on every
> >> request to select_destinations() and process each one of those records
> >> through all scheduler filters. Clearly, if we could filter the amount of
> >> compute node records that are returned by removing those nodes that do
> >> not have available capacity, we could dramatically reduce the amount of
> >> work that each call to select_destinations() would need to perform.
> >>
> >> The resource-providers-scheduler blueprint attempts to address the above
> >> problem by replacing a number of the scheduler filters that currently
> >> run *after* the database has returned all compute node records with
> >> instead a series of WHERE clauses and join conditions on the database
> >> query. The idea here is to winnow the number of returned compute node
> >> results as much as possible. The fewer records the scheduler must
> >> post-process, the faster the performance of each individual call to
> >> select_destinations().
> >
> > This is great, and I think it is the way to go. However, I'm not sure how
> > dramatic the overall benefit will be, since it also shifts some load from
> >  reads to writes.
> No, the above is *only* talking about the destination host selection 
> process, not the claim process. There are no writes here at all.
>  From my benchmarking, I see a 7.0% to 38.6% increase in the average 
> time to perform the destination selection operation when doing the 
> resource filtering on the Python side as opposed to in the DB side.

I'm talking about the destination host selection process too, but I was
just assuming you'd need compound indexes to make this really efficient,
and I assumed that would mean more indexes than exist today.

So, I guess what I may have missed was that these indexes already exist.

> As you would expect, the larger the size of the deployment, the greater 
> the performance benefit you see using the DB for querying instead of 
> Python (lower numbers are better here):
> DB or Python   # Compute Nodes   Avg Time to Select    Delta
> ------------------------------------------------------------
> DB             100               0.021035
> Python         100               0.022517              +7.0%
> DB             200               0.023370
> Python         200               0.026526             +13.5%
> DB             400               0.027638
> Python         400               0.034666             +25.4%
> DB             800               0.034814
> Python         800               0.048271             +38.6%
> The above was for a serialized scenario (1 scheduler process). Parallel 
> operations at 2, 4 and 8 scheduler processes were virtually identical as 
> can be expected since this is testing the read operation performance, 
> not the write operations.

I am not surprised at these results at all. However, I am still a little
wary of anything that happens faster in a central resource. Faster
is great, but it also means we now have to scale _up_ that central
resource. Hopefully it is so much more efficient to read indexes from
that DB instead of filter lists in python that we get a very large margin
between what lots of slow python processes could have done and what one
very fast mysqld can do.

>  > With 1000 active compute nodes updating their status,
> > each index added will be 1000 more index writes per update period. Still
> > a net win, but I'm always cautious about shifting things to more writes
> > on the database server. That said, I do think it will be a win and should
> > be done.
> Again, this isn't what the "move the filtering to the database query" 
> proposal is about :) You are describing the *claim* operation above, not 
> the select-destination operation.
> The *current* scheduler design is what has each distributed compute node 
> sending updates to the scheduler^Wdatabase each time a claim occurs. 
> What the second part of my proposal does is move the claim from the 
> distributed compute nodes and into the scheduler, which should allow the 
> scheduler to operate on non-stale data (which will reduce the number of 
> long retry operations). More below.
> >> The second major scale problem with the current Nova scheduler design
> >> has to do with the fact that the scheduler does *not* actually claim
> >> resources on a provider. Instead, the scheduler selects a destination
> >> host to place the instance on and the Nova conductor then sends a
> >> message to that target host which attempts to spawn the instance on its
> >> hypervisor. If the spawn succeeds, the target compute host updates the
> >> Nova database and decrements its count of available resources. These
> >> steps (from nova-scheduler to nova-conductor to nova-compute to
> >> database) all take some not insignificant amount of time. During this
> >> time window, a different scheduler process may pick the exact same
> >> target host for a like-sized launch request. If there is only room on
> >> the target host for one of those size requests [5], one of those spawn
> >> requests will fail and trigger a retry operation. This retry operation
> >> will attempt to repeat the scheduler placement decisions (by calling
> >> select_destinations()).
> >>
> >> This retry operation is relatively expensive and needlessly so: if the
> >> scheduler claimed the resources on the target host before sending its
> >> pick back to the scheduler, then the chances of producing a retry will
> >> be almost eliminated [6]. The resource-providers-scheduler blueprint
> >> attempts to remedy this second scaling design problem by having the
> >> scheduler write records to the allocations table before sending the
> >> selected target host back to the Nova conductor.
> >
> > *This*, to me, is the thing that makes the scheduler dramatically more
> > scalable. The ability to run as many schedulers as I expect to need to
> > respond to user requests in a reasonable amount of time, is the key to
> > victory here.
> >
> > However, I wonder how you will avoid serialization or getting into
> > a much tighter retry race for the claiming operations. There's talk
> > in the spec of inserting allocations in a table atomically. However,
> > with multiple schedulers, you'll still have the problem where one will
> > claim and the others will need to know that they cannot.
> This is handled in my proposal with a single database transaction that 
> looks at a "generation" column on each resource provider and rolls back 
> the transaction if the generation is not the same as what was read 
> during the select-destination process.
>  > We can talk
> > about nuts and bolts, but there's really only two ways this can work:
> > exclusive locking, or compare and swap retry loops.
> Yup. Compare and swap is what I propose and have implemented in the 
> placement-bench project here:
> https://github.com/jaypipes/placement-bench/blob/master/placement.py#L123-L129
> triggering a retry here:
> https://github.com/jaypipes/placement-bench/blob/master/placement.py#L212-L217
> Exclusive locking -- i.e. SELECT FOR UPDATE -- won't work on Galera 
> systems in multi-writer mode, as you already know :)

I would actually disagree here. It can totally work, and in fact, Galera
is basically doing _exactly_ what you describe with the generation
column, inside its own mechanisms, it's just using a rather obtuse way
of signalling to you that the generations got of out sync, by saying
you had a deadlock and automatically rolling back when you thought you
wanted to commit.

They're both very similar in mechanism, but one is buried deep in
Galera, and one is easier to read and has the benefit of being an
explicit approach.

> In my initial benchmarks, I have found that this compare and swap 
> approach works OK at scale (higher numbers are better here):
> # Compute Nodes   Successful claims per second
> 100               54.1
> 200               68.9
> 400               51.3
> 800               34.3
> All of the above numbers are for 8 scheduler processes, using a 
> pack-first placement strategy and using no partitioning strategy (so, 
> pretty much worst-case scenario).
> Using a simple modulo partitioning strategy but staying with the 
> pack-first placement strategy, I got much better results:
> # Compute Nodes   Successful claims per second
> 100               97.1
> 200               124.5
> 400               115.1
> 800               89.4

That's about 50 times better than what I saw on Kilo with 2 schedulers
and 1000 simulated nodes, so huzzah!

More information about the OpenStack-dev mailing list