[openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

Nikola Đipanov ndipanov at redhat.com
Fri Mar 6 09:29:52 UTC 2015


On 03/06/2015 01:56 AM, Rui Chen wrote:
> Thank you very much for in-depth discussion about this topic, @Nikola
> and @Sylvain.
> 
> I agree that we should solve the technical debt firstly, and then make
> the scheduler better.
> 

That was not necessarily my point.

I would be happy to see work on how to make the scheduler less volatile
when run in parallel, but the solution must acknowledge the eventually
(or never really) consistent nature of the data scheduler has to operate
on (in it's current design - there is also the possibility of offering
an alternative design).

I'd say that fixing the technical debt that is aimed at splitting the
scheduler out of Nova is a mostly orthogonal effort.

There have been several proposals in the past for how to make the
scheduler horizontally scalable and improve it's performance. One that I
remember from the Atlanta summit time-frame was the work done by Boris
and his team [1] (they actually did some profiling and based their work
on the bottlenecks they found). There are also some nice ideas in the
bug lifeless filed [2] since this behaviour particularly impacts ironic.

N.

[1] https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
[2] https://bugs.launchpad.net/nova/+bug/1341420


> Best Regards.
> 
> 2015-03-05 21:12 GMT+08:00 Sylvain Bauza <sbauza at redhat.com
> <mailto:sbauza at redhat.com>>:
> 
> 
>     Le 05/03/2015 13:00, Nikola Đipanov a écrit :
> 
>         On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
> 
>             Le 04/03/2015 04:51, Rui Chen a écrit :
> 
>                 Hi all,
> 
>                 I want to make it easy to launch a bunch of scheduler
>                 processes on a
>                 host, multiple scheduler workers will make use of
>                 multiple processors
>                 of host and enhance the performance of nova-scheduler.
> 
>                 I had registered a blueprint and commit a patch to
>                 implement it.
>                 https://blueprints.launchpad.__net/nova/+spec/scheduler-__multiple-workers-support
>                 <https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support>
> 
>                 This patch had applied in our performance environment
>                 and pass some
>                 test cases, like: concurrent booting multiple instances,
>                 currently we
>                 didn't find inconsistent issue.
> 
>                 IMO, nova-scheduler should been scaled horizontally on
>                 easily way, the
>                 multiple workers should been supported as an out of box
>                 feature.
> 
>                 Please feel free to discuss this feature, thanks.
> 
> 
>             As I said when reviewing your patch, I think the problem is
>             not just
>             making sure that the scheduler is thread-safe, it's more
>             about how the
>             Scheduler is accounting resources and providing a retry if those
>             consumed resources are higher than what's available.
> 
>             Here, the main problem is that two workers can actually
>             consume two
>             distinct resources on the same HostState object. In that
>             case, the
>             HostState object is decremented by the number of taken
>             resources (modulo
>             what means a resource which is not an Integer...) for both,
>             but nowhere
>             in that section, it does check that it overrides the
>             resource usage. As
>             I said, it's not just about decorating a semaphore, it's
>             more about
>             rethinking how the Scheduler is managing its resources.
> 
> 
>             That's why I'm -1 on your patch until [1] gets merged. Once
>             this BP will
>             be implemented, we will have a set of classes for managing
>             heterogeneous
>             types of resouces and consume them, so it would be quite
>             easy to provide
>             a check against them in the consume_from_instance() method.
> 
>         I feel that the above explanation does not give the full picture in
>         addition to being factually incorrect in several places. I have
>         come to
>         realize that the current behaviour of the scheduler is subtle enough
>         that just reading the code is not enough to understand all the edge
>         cases that can come up. The evidence being that it trips up even
>         people
>         that have spent significant time working on the code.
> 
>         It is also important to consider the design choices in terms of
>         tradeoffs that they were trying to make.
> 
>         So here are some facts about the way Nova does scheduling of
>         instances
>         to compute hosts, considering the amount of resources requested
>         by the
>         flavor (we will try to put the facts into a bigger picture later):
> 
>         * Scheduler receives request to chose hosts for one or more
>         instances.
>         * Upon every request (_not_ for every instance as there may be
>         several
>         instances in a request) the scheduler learns the state of the
>         resources
>         on all compute nodes from the central DB. This state may be
>         inaccurate
>         (meaning out of date).
>         * Compute resources are update by each compute host
>         periodically. This
>         is done by updating the row in the DB.
>         * The wall-clock time difference between the scheduler deciding to
>         schedule an instance, and the resource consumption being
>         reflected in
>         the data the scheduler learns from the DB can be arbitrarily
>         long (due
>         to load on the compute nodes and latency of message arrival).
>         * To cope with the above, there is a concept of retrying the request
>         that fails on a certain compute node due to the scheduling decision
>         being made with data stale at the moment of build, by default we
>         will
>         retry 3 times before giving up.
>         * When running multiple instances, decisions are made in a loop, and
>         internal in-memory view of the resources gets updated (the widely
>         misunderstood consume_from_instance method is used for this), so
>         as to
>         keep subsequent decisions as accurate as possible. As was described
>         above, this is all thrown away once the request is finished.
> 
>         Now that we understand the above, we can start to consider what
>         changes
>         when we introduce several concurrent scheduler processes.
> 
>         Several cases come to mind:
>         * Concurrent requests will no longer be serialized on reading
>         the state
>         of all hosts (due to how eventlet interacts with mysql driver).
>         * In the presence of a single request for a large number of
>         instances
>         there is going to be a drift in accuracy of the decisions made
>         by other
>         schedulers as they will not have the accounted for any of the
>         instances
>         until they actually get claimed on their respective hosts.
> 
>         All of the above limitations will likely not pose a problem
>         under normal
>         load and usage and can cause issues to start appearing when
>         nodes are
>         close to full or when there is heavy load. Also this changes
>         drastically
>         based on how we actually chose to utilize hosts (see a very
>         interesting
>         Ironic bug [1])
> 
>         Weather any of the above matters to users is dependant heavily
>         on their
>         use-case though. This is why I feel we should be providing more
>         information.
> 
>         Finally - I think it is important to accept that the scheduler
>         service
>         will always have to operate under the assumptions of stale data, and
>         build for that. Based on that I'd be happy to see real work go into
>         making multiple schedulers work well enough for most common
>         use-cases
>         while providing a way forward for people who need tighter bounds
>         on the
>         feedback loop.
> 
>         N.
> 
> 
>     Agreed 100% with all your above email. Thanks Nikola for giving time
>     on explaining how the Scheduler is working, that's (btw.) something
>     I hope to be presenting for the Vancouver Summit if my proposal is
>     accepted.
> 
>     That said, I hope my reviewers will understand that I would want to
>     see first the Scheduler being splitted and being on a separate repo
>     before working on fixing the race conditions you mentioned above.
>     Yes, I know, it's difficult to accept some limitations on the Nova
>     scheduler while many customers would want them to be fixed, but here
>     we have so many technical debt issues that I think we should really
>     work on the split itself (like we did for Kilo and what we'll
>     hopefully work for Liberty) and then discuss on the new design once
>     after that.
> 
>     -Sylvain
> 
> 
> 
>         [1] https://bugs.launchpad.net/__nova/+bug/1341420
>         <https://bugs.launchpad.net/nova/+bug/1341420>
> 
>         ______________________________________________________________________________
>         OpenStack Development Mailing List (not for usage questions)
>         Unsubscribe:
>         OpenStack-dev-request at lists.__openstack.org?subject:__unsubscribe <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>         http://lists.openstack.org/__cgi-bin/mailman/listinfo/__openstack-dev
>         <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
> 
> 
> 
>     ______________________________________________________________________________
>     OpenStack Development Mailing List (not for usage questions)
>     Unsubscribe:
>     OpenStack-dev-request at lists.__openstack.org?subject:__unsubscribe
>     <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>     http://lists.openstack.org/__cgi-bin/mailman/listinfo/__openstack-dev <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
> 
> 
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 




More information about the OpenStack-dev mailing list