[openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
afazekas at redhat.com
Tue Mar 10 11:48:00 UTC 2015
----- Original Message -----
> From: "Nikola Đipanov" <ndipanov at redhat.com>
> To: openstack-dev at lists.openstack.org
> Sent: Tuesday, March 10, 2015 10:53:01 AM
> Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
> On 03/06/2015 03:19 PM, Attila Fazekas wrote:
> > Looks like we need some kind of _per compute node_ mutex in the critical
> > section,
> > multiple scheduler MAY be able to schedule to two compute node at same
> > time,
> > but not for scheduling to the same compute node.
> > If we don't want to introduce another required component or
> > reinvent the wheel there are some possible trick with the existing globally
> > visible
> > components like with the RDMS.
> > `Randomized` destination choose is recommended in most of the possible
> > solutions,
> > alternatives are much more complex.
> > One SQL example:
> > * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table.
> > When the scheduler picks one (or multiple) node, he needs to verify is the
> > node(s) are
> > still good before sending the message to the n-cpu.
> > It can be done by re-reading the ONLY the picked hypervisor(s) related
> > data.
> > with `LOCK IN SHARE MODE`.
> > If the destination hyper-visors still OK:
> > Increase the sched_cnt value exactly by 1,
> > test is the UPDATE really update the required number of rows,
> > the WHERE part needs to contain the previous value.
> > You also need to update the resource usage on the hypervisor,
> > by the expected cost of the new vms.
> > If at least one selected node was ok, the transaction can be COMMITed.
> > If you were able to COMMIT the transaction, the relevant messages
> > can be sent.
> > The whole process needs to be repeated with the items which did not passed
> > the
> > post verification.
> > If a message sending failed, `act like` migrating the vm to another host.
> > If multiple scheduler tries to pick multiple different host in different
> > order,
> > it can lead to a DEADLOCK situation.
> > Solution: Try to have all scheduler to acquire to Shared RW locks in the
> > same order,
> > at the end.
> > Galera multi-writer (Active-Active) implication:
> > As always, retry on deadlock.
> > n-sch + n-cpu crash at the same time:
> > * If the scheduling is not finished properly, it might be fixed manually,
> > or we need to solve which still alive scheduler instance is
> > responsible for fixing the particular scheduling..
> So if I am reading the above correctly - you are basically proposing to
> move claims to the scheduler (we would atomically check if there were
> changes since the time we picked the host with the UPDATE .. WHERE using
> LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation
> level) and then updating the usage, a.k.a doing the claim in the same
> The issue here is that we still have a window between sending the
> message, and the message getting picked up by the compute host (or
> timing out) or the instance outright failing, so for sure we will need
> to ack/nack the claim in some way on the compute side.
> I believe something like this has come up before under the umbrella term
> of "moving claims to the scheduler", and was discussed in some detail on
> the latest Nova mid-cycle meetup, but only artifacts I could find were a
> few lines on this etherpad Sylvain pointed me to  that I am copying here:
> * White board the scheduler service interface
> ** note: this design won't change the existing way/logic of reconciling
> nova db != hypervisor view
> ** gantt should just return claim ids, not entire claim objects
> ** claims are acked as being in use via the resource tracker updates
> from nova-compute
> ** we still need scheduler retries for exceptional situations (admins
> doing things outside openstack, hardware changes / failures)
> ** retry logic in conductor? probably a separate item/spec
> As you can see - not much to go on (but that is material for a separate
> thread that I may start soon).
In my example, the resource needs to be considered as used before we get
anything back from the compute.
The resource can be `freed` at error handling,
hopefully be migrating to another node.
> The problem I have with this particular approach is that while it claims
> to fix some of the races (and probably does) it does so by 1) turning
> the current scheduling mechanism on it's head 2) and not providing any
> thought into the trade-offs that it will make. For example, we may get
> more correct scheduling in the general case and the correctness will not
> be affected by the number of workers, but how does the fact that we now
> do locking DB access on every request fare against the retry mechanism
> for some of the more common usage patterns. What is the increased
> overhead of calling back to he scheduler to confirm the claim? In the
> end - how do we even measure that we are going in the right direction
> with the new design.
> I personally think that different workloads will have different needs
> from the scheduler in terms of response times and tolerance to failure,
> and that we need to design for that. So as an example a cloud operator
> with very simple scheduling requirements may want to go for the no
> locking approach and optimize for response times allowing for a small
> number of instances to fail under high load/utilization due to retries,
> while some others with more complicated scheduling requirements, or less
> tolerance for data inconsistency might want to trade in response times
> by doing locking claims in the scheduler. Some similar trade-offs and
> how to deal with them are discussed in 
The proposed locking is fine grained and optimistic.
Fine grained: just does the locking on the picked nodes,
so if scheduler A picks node1, and B picks node42 they will not interfere with
Optimistic because it does not acquire the locks at the beginning (for the time consuming selection),
it just gets them at the end for very short time.
The example did locked all ~ 10000 node at the beginning.
The simple way to avoiding to multiple scheduler picks the same node,
is adding some randomness.
Usually multiple destination can be considered as good decision.
Also possible to have the scheduler to pick multiple candidates at the first part,
and the final part just pick from like 10 candidates.
Very likely other possible solutions will not be liked, because
they need to do much more against BasicDesignTenets ,
in-order to be efficient.
Please remember the original problem was we cannot,
run something on 2 process safely.
Creating only one very efficient active process also something,
what can improve the throughput in the current situation.
>  https://etherpad.openstack.org/p/kilo-nova-midcycle
>  http://research.google.com/pubs/pub41684.html
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
More information about the OpenStack-dev