[openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

Nikola Đipanov ndipanov at redhat.com
Tue Mar 10 09:53:01 UTC 2015

On 03/06/2015 03:19 PM, Attila Fazekas wrote:
> Looks like we need some kind of _per compute node_ mutex in the critical section,
> multiple scheduler MAY be able to schedule to two compute node at same time,
> but not for scheduling to the same compute node.
> If we don't want to introduce another required component or
> reinvent the wheel there are some possible trick with the existing globally visible
> components like with the RDMS.
> `Randomized` destination choose is recommended in most of the possible solutions,
> alternatives are much more complex.
> One SQL example:
> * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related table.
> When the scheduler picks one (or multiple) node, he needs to verify is the node(s) are 
> still good before sending the message to the n-cpu.
> It can be done by re-reading the ONLY the picked hypervisor(s) related data.
> If the destination hyper-visors still OK:
> Increase the sched_cnt value exactly by 1,
> test is the UPDATE really update the required number of rows,
> the WHERE part needs to contain the previous value.
> You also need to update the resource usage on the hypervisor,
>  by the expected cost of the new vms.
> If at least one selected node was ok, the transaction can be COMMITed.
> If you were able to COMMIT the transaction, the relevant messages 
>  can be sent.
> The whole process needs to be repeated with the items which did not passed the
> post verification.
> If a message sending failed, `act like` migrating the vm to another host.
> If multiple scheduler tries to pick multiple different host in different order,
> it can lead to a DEADLOCK situation.
> Solution: Try to have all scheduler to acquire to Shared RW locks in the same order,
> at the end.
> Galera multi-writer (Active-Active) implication:
> As always, retry on deadlock. 
> n-sch + n-cpu crash at the same time:
> * If the scheduling is not finished properly, it might be fixed manually,
> or we need to solve which still alive scheduler instance is 
> responsible for fixing the particular scheduling..

So if I am reading the above correctly - you are basically proposing to
move claims to the scheduler (we would atomically check if there were
changes since the time we picked the host with the UPDATE .. WHERE using
LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation
level) and then updating the usage, a.k.a doing the claim in the same

The issue here is that we still have a window between sending the
message, and the message getting picked up by the compute host (or
timing out) or the instance outright failing, so for sure we will need
to ack/nack the claim in some way on the compute side.

I believe something like this has come up before under the umbrella term
of "moving claims to the scheduler", and was discussed in some detail on
the latest Nova mid-cycle meetup, but only artifacts I could find were a
few lines on this etherpad Sylvain pointed me to [1] that I am copying here:

* White board the scheduler service interface
 ** note: this design won't change the existing way/logic of reconciling
nova db != hypervisor view
 ** gantt should just return claim ids, not entire claim objects
 ** claims are acked as being in use via the resource tracker updates
from nova-compute
 ** we still need scheduler retries for exceptional situations (admins
doing things outside openstack, hardware changes / failures)
 ** retry logic in conductor? probably a separate item/spec

As you can see - not much to go on (but that is material for a separate
thread that I may start soon).

The problem I have with this particular approach is that while it claims
to fix some of the races (and probably does) it does so by 1) turning
the current scheduling mechanism on it's head 2) and not providing any
thought into the trade-offs that it will make. For example, we may get
more correct scheduling in the general case and the correctness will not
be affected by the number of workers, but how does the fact that we now
do locking DB access on every request fare against the retry mechanism
for some of the more common usage patterns. What is the increased
overhead of calling back to he scheduler to confirm the claim? In the
end - how do we even measure that we are going in the right direction
with the new design.

I personally think that different workloads will have different needs
from the scheduler in terms of response times and tolerance to failure,
and that we need to design for that. So as an example a cloud operator
with very simple scheduling requirements may want to go for the no
locking approach and optimize for response times allowing for a small
number of instances to fail under high load/utilization due to retries,
while some others with more complicated scheduling requirements, or less
tolerance for data inconsistency might want to trade in response times
by doing locking claims in the scheduler. Some similar trade-offs and
how to deal with them are discussed in [2]


[1] https://etherpad.openstack.org/p/kilo-nova-midcycle
[2] http://research.google.com/pubs/pub41684.html

More information about the OpenStack-dev mailing list