[openstack-dev] A simple way to improve nova scheduler

Joshua Harlow harlowja at yahoo-inc.com
Tue Jul 23 22:38:21 UTC 2013

I like the idea clint.

It appears to me that the kind of scheduler 'buckets' that are being
established allow for different kind of policies around how accurate and
how 'global' the deployer wants scheduling to be (which might be a
differing policies depending on the deployer). All of these kind of
reasons start to get even more problematic when you start to do
cross-resource scheduling (volumes near compute nodes) which is I think
there was proposals for a kind of unified scheduling 'framework' (its own
project?) that focuses on this type of work. Said project stills seems
appropriate in my mind (and is desperately needed to handle the
cross-resource scheduling concerns).

- https://etherpad.openstack.org/UnifiedResourcePlacement

I'm unsure what the nova (and other projects that have similar scheduling
concepts) folks think about such a thing existing but from the last summit
there was talk about possibly figuring out how to do that. It is of course
a looooot of refactoring (and cross-project refactoring) to get there but
it seems like it would be very beneficial if all projects that were
involved with resource scheduling could use a single 'thing' to update
resource information and to ask for scheduling decisions (aka, providing a
list of desired resources and getting back where those resources are, aka
a reservation on those resources, with a later commit of those resources,
so that the resources are freed if the process asking for them fails).


On 7/23/13 3:00 PM, "Clint Byrum" <clint at fewbar.com> wrote:

>Excerpts from Boris Pavlovic's message of 2013-07-19 07:52:55 -0700:
>> Hi all,
>> In Mirantis Alexey Ovtchinnikov and me are working on nova scheduler
>> improvements.
>> As far as we can see the problem, now scheduler has two major issues:
>> 1) Scalability. Factors that contribute to bad scalability are these:
>> *) Each compute node every periodic task interval (60 sec by default)
>> updates resources state in DB.
>> *) On every boot request scheduler has to fetch information about all
>> compute nodes from DB.
>> 2) Flexibility. Flexibility perishes due to problems with:
>> *) Addiing new complex resources (such as big lists of complex objects
>> required by PCI Passthrough
>> https://review.openstack.org/#/c/34644/5/nova/db/sqlalchemy/models.py)
>> *) Using different sources of data in Scheduler for example from cinder
>> ceilometer.
>> (as required by Volume Affinity Filter
>> https://review.openstack.org/#/c/29343/)
>> We found a simple way to mitigate this issues by avoiding of DB usage
>> host state storage.
>> A more detailed discussion of the problem state and one of a possible
>> solution can be found here:
>This is really interesting work, thanks for sharing it with us. The
>discussion that has followed has brought up some thoughts I've had for
>a while about this choke point in what is supposed to be an extremely
>scalable cloud platform (OpenStack).
>I feel like the discussions have all been centered around making "the"
>scheduler(s) intelligent.  There seems to be a commonly held belief that
>scheduling is a single step, and should be done with as much knowledge
>of the system as possible by a well informed entity.
>Can you name for me one large scale system that has a single entity,
>human or computer, that knows everything about the system and can make
>good decisions quickly?
>This problem is screaming to be broken up, de-coupled, and distributed.
>I keep asking myself these questions:
>Why are all of the compute nodes informing all of the schedulers?
>Why are all of the schedulers expecting to know about all of the compute
>Can we break this problem up into simpler problems and distribute the
>load to
>the entire system?
>This has been bouncing around in my head for a while now, but as a
>shallow observer of nova dev, I feel like there are some well known
>scaling techniques which have not been brought up. Here is my idea,
>forgive me if I have glossed over something or missed a huge hole:
>* Schedulers break up compute nodes by hash table, only caring about
>  those in their hash table.
>* Schedulers, upon claiming a compute node by hash table, poll compute
>  node directly for its information.
>* Requests to boot go into fanout.
>* Schedulers get request and try to satisfy using only their own compute
>  nodes.
>* Failure to boot results in re-insertion in the fanout.
>This gives up the certainty that the scheduler will find a compute node
>for a boot request on the first try. It is also possible that a request
>gets unlucky and takes a long time to find the one scheduler that has
>the one last "X" resource that it is looking for. There are some further
>optimization strategies that can be employed (like queues based on hashes
>already tried.. etc).
>Anyway, I don't see any point in trying to hot-rod the intelligent
>scheduler to go super fast, when we can just optimize for having many
>many schedulers doing the same body of work without blocking and without
>pounding a database.
>OpenStack-dev mailing list
>OpenStack-dev at lists.openstack.org

More information about the OpenStack-dev mailing list