[openstack-dev] Scheduler proposal

Ian Wells ijw.ubuntu at cack.org.uk
Sat Oct 10 02:14:17 UTC 2015

On 9 October 2015 at 18:29, Clint Byrum <clint at fewbar.com> wrote:

> Instead of having the scheduler do all of the compute node inspection
> and querying though, you have the nodes push their stats into something
> like Zookeeper or consul, and then have schedulers watch those stats
> for changes to keep their in-memory version of the data up to date. So
> when you bring a new one online, you don't have to query all the nodes,
> you just scrape the data store, which all of these stores (etcd, consul,
> ZK) are built to support atomically querying and watching at the same
> time, so you can have a reasonable expectation of correctness.

We have to be careful about our definition of 'correctness' here.  In
practice, the data is never going to be perfect because compute hosts
update periodically and the information is therefore always dated.  With
ZK, it's going to be strictly consistent with regard to the updates from
the compute hosts, but again that doesn't really matter too much because
the scheduler is going to have to make a best effort job with a mixed bag
of information anyway.

In fact, putting ZK in the middle basically means that your compute hosts
now synchronously update a majority of nodes in a minimum 3 node quorum -
not the fastest form of update - and then the quorum will see to notifying
the schedulers.  In practice this is just a store-and-fanout again. Once
more it's not clear to me whether the store serves much use, and as for the
fanout, I wonder if we'll need >>3 schedulers running so that this is
reducing communication overhead.

Even if you figured out how to make the in-memory scheduler crazy fast,
> There's still value in concurrency for other reasons. No matter how
> fast you make the scheduler, you'll be slave to the response time of
> a single scheduling request. If you take 1ms to schedule each node
> (including just reading the request and pushing out your scheduling
> result!) you will never achieve greater than 1000/s. 1ms is way lower
> than it's going to take just to shove a tiny message into RabbitMQ or
> even 0mq. So I'm pretty sure this is o-k for small clouds, but would be
> a disaster for a large, busy cloud.

Per before, my suggestion was that every scheduler tries to maintain a copy
of the cloud's state in memory (in much the same way, per the previous
example, as every router on the internet tries to make a route table out of
what it learns from BGP).  They don't have to be perfect.  They don't have
to be in sync.  As long as there's some variability in the decision making,
they don't have to update when another scheduler schedules something (and
you can make the compute node send an immediate update when a new VM is
run, anyway).  They all stand a good chance of scheduling VMs well

If, however, you can have 20 schedulers that all take 10ms on average,
> and have the occasional lock contention for a resource counter resulting
> in 100ms, now you're at 2000/s minus the lock contention rate. This
> strategy would scale better with the number of compute nodes, since
> more nodes means more distinct locks, so you can scale out the number
> of running servers separate from the number of scheduling requests.

If you have 20 schedulers that take 1ms on average, and there's absolutely
no lock contention, then you're at 20,000/s.  (Unfair, granted, since what
I'm suggesting is more likely to make rejected scheduling decisions, but
they could be rare.)

But to be fair, we're throwing made up numbers around at this point.  Maybe
it's time to work out how to test this for scale in a harness - which is
the bit of work we all really need to do this properly, or there's no proof
we've actually helped - and leave people to code their ideas up?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151009/1ed97101/attachment.html>

More information about the OpenStack-dev mailing list