[openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

Nikola Đipanov ndipanov at redhat.com
Thu Mar 5 12:00:16 UTC 2015

On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
> Le 04/03/2015 04:51, Rui Chen a écrit :
>> Hi all,
>> I want to make it easy to launch a bunch of scheduler processes on a
>> host, multiple scheduler workers will make use of multiple processors
>> of host and enhance the performance of nova-scheduler.
>> I had registered a blueprint and commit a patch to implement it.
>> https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support
>> This patch had applied in our performance environment and pass some
>> test cases, like: concurrent booting multiple instances, currently we
>> didn't find inconsistent issue.
>> IMO, nova-scheduler should been scaled horizontally on easily way, the
>> multiple workers should been supported as an out of box feature.
>> Please feel free to discuss this feature, thanks.
> As I said when reviewing your patch, I think the problem is not just
> making sure that the scheduler is thread-safe, it's more about how the
> Scheduler is accounting resources and providing a retry if those
> consumed resources are higher than what's available.
> Here, the main problem is that two workers can actually consume two
> distinct resources on the same HostState object. In that case, the
> HostState object is decremented by the number of taken resources (modulo
> what means a resource which is not an Integer...) for both, but nowhere
> in that section, it does check that it overrides the resource usage. As
> I said, it's not just about decorating a semaphore, it's more about
> rethinking how the Scheduler is managing its resources.
> That's why I'm -1 on your patch until [1] gets merged. Once this BP will
> be implemented, we will have a set of classes for managing heterogeneous
> types of resouces and consume them, so it would be quite easy to provide
> a check against them in the consume_from_instance() method.

I feel that the above explanation does not give the full picture in
addition to being factually incorrect in several places. I have come to
realize that the current behaviour of the scheduler is subtle enough
that just reading the code is not enough to understand all the edge
cases that can come up. The evidence being that it trips up even people
that have spent significant time working on the code.

It is also important to consider the design choices in terms of
tradeoffs that they were trying to make.

So here are some facts about the way Nova does scheduling of instances
to compute hosts, considering the amount of resources requested by the
flavor (we will try to put the facts into a bigger picture later):

* Scheduler receives request to chose hosts for one or more instances.
* Upon every request (_not_ for every instance as there may be several
instances in a request) the scheduler learns the state of the resources
on all compute nodes from the central DB. This state may be inaccurate
(meaning out of date).
* Compute resources are update by each compute host periodically. This
is done by updating the row in the DB.
* The wall-clock time difference between the scheduler deciding to
schedule an instance, and the resource consumption being reflected in
the data the scheduler learns from the DB can be arbitrarily long (due
to load on the compute nodes and latency of message arrival).
* To cope with the above, there is a concept of retrying the request
that fails on a certain compute node due to the scheduling decision
being made with data stale at the moment of build, by default we will
retry 3 times before giving up.
* When running multiple instances, decisions are made in a loop, and
internal in-memory view of the resources gets updated (the widely
misunderstood consume_from_instance method is used for this), so as to
keep subsequent decisions as accurate as possible. As was described
above, this is all thrown away once the request is finished.

Now that we understand the above, we can start to consider what changes
when we introduce several concurrent scheduler processes.

Several cases come to mind:
* Concurrent requests will no longer be serialized on reading the state
of all hosts (due to how eventlet interacts with mysql driver).
* In the presence of a single request for a large number of instances
there is going to be a drift in accuracy of the decisions made by other
schedulers as they will not have the accounted for any of the instances
until they actually get claimed on their respective hosts.

All of the above limitations will likely not pose a problem under normal
load and usage and can cause issues to start appearing when nodes are
close to full or when there is heavy load. Also this changes drastically
based on how we actually chose to utilize hosts (see a very interesting
Ironic bug [1])

Weather any of the above matters to users is dependant heavily on their
use-case though. This is why I feel we should be providing more information.

Finally - I think it is important to accept that the scheduler service
will always have to operate under the assumptions of stale data, and
build for that. Based on that I'd be happy to see real work go into
making multiple schedulers work well enough for most common use-cases
while providing a way forward for people who need tighter bounds on the
feedback loop.


[1] https://bugs.launchpad.net/nova/+bug/1341420

More information about the OpenStack-dev mailing list