[openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

Attila Fazekas afazekas at redhat.com
Fri Mar 6 15:19:18 UTC 2015


Looks like we need some kind of _per compute node_ mutex in the critical section,
multiple scheduler MAY be able to schedule to two compute node at same time,
but not for scheduling to the same compute node.

If we don't want to introduce another required component or
reinvent the wheel there are some possible trick with the existing globally visible
components like with the RDMS.

`Randomized` destination choose is recommended in most of the possible solutions,
alternatives are much more complex.

One SQL example:

* Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table.

When the scheduler picks one (or multiple) node, he needs to verify is the node(s) are 
still good before sending the message to the n-cpu.

It can be done by re-reading the ONLY the picked hypervisor(s) related data.
with `LOCK IN SHARE MODE`.
If the destination hyper-visors still OK:

Increase the sched_cnt value exactly by 1,
test is the UPDATE really update the required number of rows,
the WHERE part needs to contain the previous value.

You also need to update the resource usage on the hypervisor,
 by the expected cost of the new vms.

If at least one selected node was ok, the transaction can be COMMITed.
If you were able to COMMIT the transaction, the relevant messages 
 can be sent.

The whole process needs to be repeated with the items which did not passed the
post verification.

If a message sending failed, `act like` migrating the vm to another host.

If multiple scheduler tries to pick multiple different host in different order,
it can lead to a DEADLOCK situation.
Solution: Try to have all scheduler to acquire to Shared RW locks in the same order,
at the end.

Galera multi-writer (Active-Active) implication:
As always, retry on deadlock. 

n-sch + n-cpu crash at the same time:
* If the scheduling is not finished properly, it might be fixed manually,
or we need to solve which still alive scheduler instance is 
responsible for fixing the particular scheduling..


----- Original Message -----
> From: "Nikola Đipanov" <ndipanov at redhat.com>
> To: openstack-dev at lists.openstack.org
> Sent: Friday, March 6, 2015 10:29:52 AM
> Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
> 
> On 03/06/2015 01:56 AM, Rui Chen wrote:
> > Thank you very much for in-depth discussion about this topic, @Nikola
> > and @Sylvain.
> > 
> > I agree that we should solve the technical debt firstly, and then make
> > the scheduler better.
> > 
> 
> That was not necessarily my point.
> 
> I would be happy to see work on how to make the scheduler less volatile
> when run in parallel, but the solution must acknowledge the eventually
> (or never really) consistent nature of the data scheduler has to operate
> on (in it's current design - there is also the possibility of offering
> an alternative design).
> 
> I'd say that fixing the technical debt that is aimed at splitting the
> scheduler out of Nova is a mostly orthogonal effort.
> 
> There have been several proposals in the past for how to make the
> scheduler horizontally scalable and improve it's performance. One that I
> remember from the Atlanta summit time-frame was the work done by Boris
> and his team [1] (they actually did some profiling and based their work
> on the bottlenecks they found). There are also some nice ideas in the
> bug lifeless filed [2] since this behaviour particularly impacts ironic.
> 
> N.
> 
> [1] https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
> [2] https://bugs.launchpad.net/nova/+bug/1341420
> 
> 
> > Best Regards.
> > 
> > 2015-03-05 21:12 GMT+08:00 Sylvain Bauza <sbauza at redhat.com
> > <mailto:sbauza at redhat.com>>:
> > 
> > 
> >     Le 05/03/2015 13:00, Nikola Đipanov a écrit :
> > 
> >         On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
> > 
> >             Le 04/03/2015 04:51, Rui Chen a écrit :
> > 
> >                 Hi all,
> > 
> >                 I want to make it easy to launch a bunch of scheduler
> >                 processes on a
> >                 host, multiple scheduler workers will make use of
> >                 multiple processors
> >                 of host and enhance the performance of nova-scheduler.
> > 
> >                 I had registered a blueprint and commit a patch to
> >                 implement it.
> >                 https://blueprints.launchpad.__net/nova/+spec/scheduler-__multiple-workers-support
> >                 <https://blueprints.launchpad.net/nova/+spec/scheduler-multiple-workers-support>
> > 
> >                 This patch had applied in our performance environment
> >                 and pass some
> >                 test cases, like: concurrent booting multiple instances,
> >                 currently we
> >                 didn't find inconsistent issue.
> > 
> >                 IMO, nova-scheduler should been scaled horizontally on
> >                 easily way, the
> >                 multiple workers should been supported as an out of box
> >                 feature.
> > 
> >                 Please feel free to discuss this feature, thanks.
> > 
> > 
> >             As I said when reviewing your patch, I think the problem is
> >             not just
> >             making sure that the scheduler is thread-safe, it's more
> >             about how the
> >             Scheduler is accounting resources and providing a retry if
> >             those
> >             consumed resources are higher than what's available.
> > 
> >             Here, the main problem is that two workers can actually
> >             consume two
> >             distinct resources on the same HostState object. In that
> >             case, the
> >             HostState object is decremented by the number of taken
> >             resources (modulo
> >             what means a resource which is not an Integer...) for both,
> >             but nowhere
> >             in that section, it does check that it overrides the
> >             resource usage. As
> >             I said, it's not just about decorating a semaphore, it's
> >             more about
> >             rethinking how the Scheduler is managing its resources.
> > 
> > 
> >             That's why I'm -1 on your patch until [1] gets merged. Once
> >             this BP will
> >             be implemented, we will have a set of classes for managing
> >             heterogeneous
> >             types of resouces and consume them, so it would be quite
> >             easy to provide
> >             a check against them in the consume_from_instance() method.
> > 
> >         I feel that the above explanation does not give the full picture in
> >         addition to being factually incorrect in several places. I have
> >         come to
> >         realize that the current behaviour of the scheduler is subtle
> >         enough
> >         that just reading the code is not enough to understand all the edge
> >         cases that can come up. The evidence being that it trips up even
> >         people
> >         that have spent significant time working on the code.
> > 
> >         It is also important to consider the design choices in terms of
> >         tradeoffs that they were trying to make.
> > 
> >         So here are some facts about the way Nova does scheduling of
> >         instances
> >         to compute hosts, considering the amount of resources requested
> >         by the
> >         flavor (we will try to put the facts into a bigger picture later):
> > 
> >         * Scheduler receives request to chose hosts for one or more
> >         instances.
> >         * Upon every request (_not_ for every instance as there may be
> >         several
> >         instances in a request) the scheduler learns the state of the
> >         resources
> >         on all compute nodes from the central DB. This state may be
> >         inaccurate
> >         (meaning out of date).
> >         * Compute resources are update by each compute host
> >         periodically. This
> >         is done by updating the row in the DB.
> >         * The wall-clock time difference between the scheduler deciding to
> >         schedule an instance, and the resource consumption being
> >         reflected in
> >         the data the scheduler learns from the DB can be arbitrarily
> >         long (due
> >         to load on the compute nodes and latency of message arrival).
> >         * To cope with the above, there is a concept of retrying the
> >         request
> >         that fails on a certain compute node due to the scheduling decision
> >         being made with data stale at the moment of build, by default we
> >         will
> >         retry 3 times before giving up.
> >         * When running multiple instances, decisions are made in a loop,
> >         and
> >         internal in-memory view of the resources gets updated (the widely
> >         misunderstood consume_from_instance method is used for this), so
> >         as to
> >         keep subsequent decisions as accurate as possible. As was described
> >         above, this is all thrown away once the request is finished.
> > 
> >         Now that we understand the above, we can start to consider what
> >         changes
> >         when we introduce several concurrent scheduler processes.
> > 
> >         Several cases come to mind:
> >         * Concurrent requests will no longer be serialized on reading
> >         the state
> >         of all hosts (due to how eventlet interacts with mysql driver).
> >         * In the presence of a single request for a large number of
> >         instances
> >         there is going to be a drift in accuracy of the decisions made
> >         by other
> >         schedulers as they will not have the accounted for any of the
> >         instances
> >         until they actually get claimed on their respective hosts.
> > 
> >         All of the above limitations will likely not pose a problem
> >         under normal
> >         load and usage and can cause issues to start appearing when
> >         nodes are
> >         close to full or when there is heavy load. Also this changes
> >         drastically
> >         based on how we actually chose to utilize hosts (see a very
> >         interesting
> >         Ironic bug [1])
> > 
> >         Weather any of the above matters to users is dependant heavily
> >         on their
> >         use-case though. This is why I feel we should be providing more
> >         information.
> > 
> >         Finally - I think it is important to accept that the scheduler
> >         service
> >         will always have to operate under the assumptions of stale data,
> >         and
> >         build for that. Based on that I'd be happy to see real work go into
> >         making multiple schedulers work well enough for most common
> >         use-cases
> >         while providing a way forward for people who need tighter bounds
> >         on the
> >         feedback loop.
> > 
> >         N.
> > 
> > 
> >     Agreed 100% with all your above email. Thanks Nikola for giving time
> >     on explaining how the Scheduler is working, that's (btw.) something
> >     I hope to be presenting for the Vancouver Summit if my proposal is
> >     accepted.
> > 
> >     That said, I hope my reviewers will understand that I would want to
> >     see first the Scheduler being splitted and being on a separate repo
> >     before working on fixing the race conditions you mentioned above.
> >     Yes, I know, it's difficult to accept some limitations on the Nova
> >     scheduler while many customers would want them to be fixed, but here
> >     we have so many technical debt issues that I think we should really
> >     work on the split itself (like we did for Kilo and what we'll
> >     hopefully work for Liberty) and then discuss on the new design once
> >     after that.
> > 
> >     -Sylvain
> > 
> > 
> > 
> >         [1] https://bugs.launchpad.net/__nova/+bug/1341420
> >         <https://bugs.launchpad.net/nova/+bug/1341420>
> > 
> >         ______________________________________________________________________________
> >         OpenStack Development Mailing List (not for usage questions)
> >         Unsubscribe:
> >         OpenStack-dev-request at lists.__openstack.org?subject:__unsubscribe
> >         <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> >         http://lists.openstack.org/__cgi-bin/mailman/listinfo/__openstack-dev
> >         <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
> > 
> > 
> > 
> >     ______________________________________________________________________________
> >     OpenStack Development Mailing List (not for usage questions)
> >     Unsubscribe:
> >     OpenStack-dev-request at lists.__openstack.org?subject:__unsubscribe
> >     <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> >     http://lists.openstack.org/__cgi-bin/mailman/listinfo/__openstack-dev
> >     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
> > 
> > 
> > 
> > 
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > 
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list