[openstack-dev] A simple way to improve nova scheduler

Clint Byrum clint at fewbar.com
Tue Jul 23 22:00:10 UTC 2013

Excerpts from Boris Pavlovic's message of 2013-07-19 07:52:55 -0700:
> Hi all,
> In Mirantis Alexey Ovtchinnikov and me are working on nova scheduler
> improvements.
> As far as we can see the problem, now scheduler has two major issues:
> 1) Scalability. Factors that contribute to bad scalability are these:
> *) Each compute node every periodic task interval (60 sec by default)
> updates resources state in DB.
> *) On every boot request scheduler has to fetch information about all
> compute nodes from DB.
> 2) Flexibility. Flexibility perishes due to problems with:
> *) Addiing new complex resources (such as big lists of complex objects e.g.
> required by PCI Passthrough
> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
> *) Using different sources of data in Scheduler for example from cinder or
> ceilometer.
> (as required by Volume Affinity Filter
> https://review.openstack.org/#/c/29343/)
> We found a simple way to mitigate this issues by avoiding of DB usage for
> host state storage.
> A more detailed discussion of the problem state and one of a possible
> solution can be found here:
> https://docs.google.com/document/d/1_DRv7it_mwalEZzLy5WO92TJcummpmWL4NWsWf0UWiQ/edit#

This is really interesting work, thanks for sharing it with us. The
discussion that has followed has brought up some thoughts I've had for
a while about this choke point in what is supposed to be an extremely
scalable cloud platform (OpenStack).

I feel like the discussions have all been centered around making "the"
scheduler(s) intelligent.  There seems to be a commonly held belief that
scheduling is a single step, and should be done with as much knowledge
of the system as possible by a well informed entity.

Can you name for me one large scale system that has a single entity,
human or computer, that knows everything about the system and can make
good decisions quickly?

This problem is screaming to be broken up, de-coupled, and distributed.

I keep asking myself these questions:

Why are all of the compute nodes informing all of the schedulers?

Why are all of the schedulers expecting to know about all of the compute nodes?

Can we break this problem up into simpler problems and distribute the load to
the entire system?

This has been bouncing around in my head for a while now, but as a
shallow observer of nova dev, I feel like there are some well known
scaling techniques which have not been brought up. Here is my idea,
forgive me if I have glossed over something or missed a huge hole:

* Schedulers break up compute nodes by hash table, only caring about
  those in their hash table.
* Schedulers, upon claiming a compute node by hash table, poll compute
  node directly for its information.
* Requests to boot go into fanout.
* Schedulers get request and try to satisfy using only their own compute
* Failure to boot results in re-insertion in the fanout.

This gives up the certainty that the scheduler will find a compute node
for a boot request on the first try. It is also possible that a request
gets unlucky and takes a long time to find the one scheduler that has
the one last "X" resource that it is looking for. There are some further
optimization strategies that can be employed (like queues based on hashes
already tried.. etc).

Anyway, I don't see any point in trying to hot-rod the intelligent
scheduler to go super fast, when we can just optimize for having many
many schedulers doing the same body of work without blocking and without
pounding a database.

More information about the OpenStack-dev mailing list