[openstack-dev] A simple way to improve nova scheduler

Joe Gordon joe.gordon0 at gmail.com
Wed Jul 24 18:43:46 UTC 2013


On Wed, Jul 24, 2013 at 12:24 PM, Russell Bryant <rbryant at redhat.com> wrote:

> On 07/23/2013 06:00 PM, Clint Byrum wrote:
> > This is really interesting work, thanks for sharing it with us. The
> > discussion that has followed has brought up some thoughts I've had for
> > a while about this choke point in what is supposed to be an extremely
> > scalable cloud platform (OpenStack).
> >
> > I feel like the discussions have all been centered around making "the"
> > scheduler(s) intelligent.  There seems to be a commonly held belief that
> > scheduling is a single step, and should be done with as much knowledge
> > of the system as possible by a well informed entity.
> >
> > Can you name for me one large scale system that has a single entity,
> > human or computer, that knows everything about the system and can make
> > good decisions quickly?
> >
> > This problem is screaming to be broken up, de-coupled, and distributed.
> >
> > I keep asking myself these questions:
> >
> > Why are all of the compute nodes informing all of the schedulers?
>
 >
> > Why are all of the schedulers expecting to know about all of the compute
> nodes?
>

So the scheduler can try to find the globally optimum solution, see below.


> >
> > Can we break this problem up into simpler problems and distribute the
> load to
> > the entire system?
> >
> > This has been bouncing around in my head for a while now, but as a
> > shallow observer of nova dev, I feel like there are some well known
> > scaling techniques which have not been brought up. Here is my idea,
> > forgive me if I have glossed over something or missed a huge hole:
> >
> > * Schedulers break up compute nodes by hash table, only caring about
> >   those in their hash table.
> > * Schedulers, upon claiming a compute node by hash table, poll compute
> >   node directly for its information.
>

For people who want to schedule on information that is constantly changing
(such as CPU load, memory usage etc).  How often would you poll?


> > * Requests to boot go into fanout.
> > * Schedulers get request and try to satisfy using only their own compute
> >   nodes.
> > * Failure to boot results in re-insertion in the fanout.
>

With this model we loose the ability to find the global optimum host to
schedule on, and can only find an optimal solution.  Which sounds like a
reasonable scale trade off.  Going forward I can image nova having several
different schedulers for different requirements.  As someone who is
deploying at a massive scale will probably accept an optimal solution (and
a scheduler that scales better) but someone with a smaller cloud will want
the globally optimum solution.


> >
> > This gives up the certainty that the scheduler will find a compute node
> > for a boot request on the first try. It is also possible that a request
> > gets unlucky and takes a long time to find the one scheduler that has
> > the one last "X" resource that it is looking for. There are some further
> > optimization strategies that can be employed (like queues based on hashes
> > already tried.. etc).
> >
> > Anyway, I don't see any point in trying to hot-rod the intelligent
> > scheduler to go super fast, when we can just optimize for having many
> > many schedulers doing the same body of work without blocking and without
> > pounding a database.
>
> These are some *very* good observations.  I'd like all of the nova folks
> interested in this are to give some deep consideration of this type of
> approach.
>
>
I agree an approach like this is very interesting and is something worth
exploring, especially at the summit.   There are some clear pros and cons
to an approach like this.  For example this will scale better, but cannot
find the optimum node to schedule on.  My question is, at what scale does
it make sense to adopt an approach like this?  And how can we improve our
current scheduler to scale better, not that it will ever scale better then
the idea proposed here.

While talking about scale there are some other big issues, such as RPC that
need be be sorted out as well.


>  --
> Russell Bryant
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130724/87249eef/attachment.html>


More information about the OpenStack-dev mailing list