[openstack-dev] Scheduler proposal

Clint Byrum clint at fewbar.com
Sun Oct 11 06:47:29 UTC 2015

Excerpts from Ian Wells's message of 2015-10-09 19:14:17 -0700:
> On 9 October 2015 at 18:29, Clint Byrum <clint at fewbar.com> wrote:
> > Instead of having the scheduler do all of the compute node inspection
> > and querying though, you have the nodes push their stats into something
> > like Zookeeper or consul, and then have schedulers watch those stats
> > for changes to keep their in-memory version of the data up to date. So
> > when you bring a new one online, you don't have to query all the nodes,
> > you just scrape the data store, which all of these stores (etcd, consul,
> > ZK) are built to support atomically querying and watching at the same
> > time, so you can have a reasonable expectation of correctness.
> >
> We have to be careful about our definition of 'correctness' here.  In
> practice, the data is never going to be perfect because compute hosts
> update periodically and the information is therefore always dated.  With
> ZK, it's going to be strictly consistent with regard to the updates from
> the compute hosts, but again that doesn't really matter too much because
> the scheduler is going to have to make a best effort job with a mixed bag
> of information anyway.

I was actually thinking nodes would update ZK _when they are changed
themselves_. As in, the scheduler would reduce the available resources
upon allocating them, and the nodes would only update them after
reclaiming those resources or when they start fresh.

> In fact, putting ZK in the middle basically means that your compute hosts
> now synchronously update a majority of nodes in a minimum 3 node quorum -
> not the fastest form of update - and then the quorum will see to notifying
> the schedulers.  In practice this is just a store-and-fanout again. Once
> more it's not clear to me whether the store serves much use, and as for the
> fanout, I wonder if we'll need >>3 schedulers running so that this is
> reducing communication overhead.

This is indeed store and fanout. Except unlike mysql+rabbitMQ, we're
using a service optimized for store and fanout. :)

All of the DLM-ish primitive things we've talked about can handle a
ton of churn in what turns out to be very small amounts of data. The
difference here is that instead of a scheduler querying for the data,
it has already received it because it was watching for changes. And
if some of it hasn't changed, there's no query, and there's no fanout,
and the local cache is just used.

So yes, if we did things the same as now, this would be terrible. But we
wouldn't. We'd let ZK or Consul do this for us, because they are better
than anything we can build to do this.

> Even if you figured out how to make the in-memory scheduler crazy fast,
> > There's still value in concurrency for other reasons. No matter how
> > fast you make the scheduler, you'll be slave to the response time of
> > a single scheduling request. If you take 1ms to schedule each node
> > (including just reading the request and pushing out your scheduling
> > result!) you will never achieve greater than 1000/s. 1ms is way lower
> > than it's going to take just to shove a tiny message into RabbitMQ or
> > even 0mq. So I'm pretty sure this is o-k for small clouds, but would be
> > a disaster for a large, busy cloud.
> >
> Per before, my suggestion was that every scheduler tries to maintain a copy
> of the cloud's state in memory (in much the same way, per the previous
> example, as every router on the internet tries to make a route table out of
> what it learns from BGP).  They don't have to be perfect.  They don't have
> to be in sync.  As long as there's some variability in the decision making,
> they don't have to update when another scheduler schedules something (and
> you can make the compute node send an immediate update when a new VM is
> run, anyway).  They all stand a good chance of scheduling VMs well
> simultaneously.

I'm quite in favor of eventual consistency and retries. Even if we had
a system of perfect updating of all state records everywhere, it would
break sometimes and I'd still want to not trust any record of state as
being correct for the entire distributed system. However, there is an
efficiency win gained by staying _close_ to correct. It is actually a
function of the expected entropy. The more concurrent schedulers, the
more entropy there will be to deal with.

> If, however, you can have 20 schedulers that all take 10ms on average,
> > and have the occasional lock contention for a resource counter resulting
> > in 100ms, now you're at 2000/s minus the lock contention rate. This
> > strategy would scale better with the number of compute nodes, since
> > more nodes means more distinct locks, so you can scale out the number
> > of running servers separate from the number of scheduling requests.
> >
> If you have 20 schedulers that take 1ms on average, and there's absolutely
> no lock contention, then you're at 20,000/s.  (Unfair, granted, since what
> I'm suggesting is more likely to make rejected scheduling decisions, but
> they could be rare.)
> But to be fair, we're throwing made up numbers around at this point.  Maybe
> it's time to work out how to test this for scale in a harness - which is
> the bit of work we all really need to do this properly, or there's no proof
> we've actually helped - and leave people to code their ideas up?

I'm working on adding meters for rates and amounts of messages and
queries that the system does right now for performance purposes. Rally
though, is the place where I'd go to ask "how fast can we schedule things
right now?".

More information about the OpenStack-dev mailing list