[openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

Attila Fazekas afazekas at redhat.com
Tue Mar 10 13:07:06 UTC 2015





----- Original Message -----
> From: "Attila Fazekas" <afazekas at redhat.com>
> To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
> Sent: Tuesday, March 10, 2015 12:48:00 PM
> Subject: Re: [openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler
> 
> 
> 
> 
> 
> ----- Original Message -----
> > From: "Nikola Đipanov" <ndipanov at redhat.com>
> > To: openstack-dev at lists.openstack.org
> > Sent: Tuesday, March 10, 2015 10:53:01 AM
> > Subject: Re: [openstack-dev] [nova] blueprint about multiple workers
> > supported in nova-scheduler
> > 
> > On 03/06/2015 03:19 PM, Attila Fazekas wrote:
> > > Looks like we need some kind of _per compute node_ mutex in the critical
> > > section,
> > > multiple scheduler MAY be able to schedule to two compute node at same
> > > time,
> > > but not for scheduling to the same compute node.
> > > 
> > > If we don't want to introduce another required component or
> > > reinvent the wheel there are some possible trick with the existing
> > > globally
> > > visible
> > > components like with the RDMS.
> > > 
> > > `Randomized` destination choose is recommended in most of the possible
> > > solutions,
> > > alternatives are much more complex.
> > > 
> > > One SQL example:
> > > 
> > > * Add `sched_cnt`, defaul=0, Integer field; to a hypervisors related
> > > table.
> > > 
> > > When the scheduler picks one (or multiple) node, he needs to verify is
> > > the
> > > node(s) are
> > > still good before sending the message to the n-cpu.
> > > 
> > > It can be done by re-reading the ONLY the picked hypervisor(s) related
> > > data.
> > > with `LOCK IN SHARE MODE`.
> > > If the destination hyper-visors still OK:
> > > 
> > > Increase the sched_cnt value exactly by 1,
> > > test is the UPDATE really update the required number of rows,
> > > the WHERE part needs to contain the previous value.
> > > 
> > > You also need to update the resource usage on the hypervisor,
> > >  by the expected cost of the new vms.
> > > 
> > > If at least one selected node was ok, the transaction can be COMMITed.
> > > If you were able to COMMIT the transaction, the relevant messages
> > >  can be sent.
> > > 
> > > The whole process needs to be repeated with the items which did not
> > > passed
> > > the
> > > post verification.
> > > 
> > > If a message sending failed, `act like` migrating the vm to another host.
> > > 
> > > If multiple scheduler tries to pick multiple different host in different
> > > order,
> > > it can lead to a DEADLOCK situation.
> > > Solution: Try to have all scheduler to acquire to Shared RW locks in the
> > > same order,
> > > at the end.
> > > 
> > > Galera multi-writer (Active-Active) implication:
> > > As always, retry on deadlock.
> > > 
> > > n-sch + n-cpu crash at the same time:
> > > * If the scheduling is not finished properly, it might be fixed manually,
> > > or we need to solve which still alive scheduler instance is
> > > responsible for fixing the particular scheduling..
> > > 
> > 
> > So if I am reading the above correctly - you are basically proposing to
> > move claims to the scheduler (we would atomically check if there were
> > changes since the time we picked the host with the UPDATE .. WHERE using
> > LOCK IN SHARE MODE (assuming REPEATABLE READS is the used isolation
> > level) and then updating the usage, a.k.a doing the claim in the same
> > transaction.
> > 
> > The issue here is that we still have a window between sending the
> > message, and the message getting picked up by the compute host (or
> > timing out) or the instance outright failing, so for sure we will need
> > to ack/nack the claim in some way on the compute side.
> > 
> > I believe something like this has come up before under the umbrella term
> > of "moving claims to the scheduler", and was discussed in some detail on
> > the latest Nova mid-cycle meetup, but only artifacts I could find were a
> > few lines on this etherpad Sylvain pointed me to [1] that I am copying
> > here:
> > 
> >
> > """
> > * White board the scheduler service interface
> >  ** note: this design won't change the existing way/logic of reconciling
> > nova db != hypervisor view
> >  ** gantt should just return claim ids, not entire claim objects
> >  ** claims are acked as being in use via the resource tracker updates
> > from nova-compute
> >  ** we still need scheduler retries for exceptional situations (admins
> > doing things outside openstack, hardware changes / failures)
> >  ** retry logic in conductor? probably a separate item/spec
> > """
> > 
> > As you can see - not much to go on (but that is material for a separate
> > thread that I may start soon).
> >
> In my example, the resource needs to be considered as used before we get
> anything back from the compute.
> The resource can be `freed` at error handling,
> hopefully be migrating to another node.
>  
> > The problem I have with this particular approach is that while it claims
> > to fix some of the races (and probably does) it does so by 1) turning
> > the current scheduling mechanism on it's head 2) and not providing any
> > thought into the trade-offs that it will make. For example, we may get
> > more correct scheduling in the general case and the correctness will not
> > be affected by the number of workers, but how does the fact that we now
> > do locking DB access on every request fare against the retry mechanism
> > for some of the more common usage patterns. What is the increased
> > overhead of calling back to he scheduler to confirm the claim? In the
> > end - how do we even measure that we are going in the right direction
> > with the new design.
> > 
> > I personally think that different workloads will have different needs
> > from the scheduler in terms of response times and tolerance to failure,
> > and that we need to design for that. So as an example a cloud operator
> > with very simple scheduling requirements may want to go for the no
> > locking approach and optimize for response times allowing for a small
> > number of instances to fail under high load/utilization due to retries,
> > while some others with more complicated scheduling requirements, or less
> > tolerance for data inconsistency might want to trade in response times
> > by doing locking claims in the scheduler. Some similar trade-offs and
> > how to deal with them are discussed in [2]
> > 
> 
> The proposed locking is fine grained and optimistic.
> Fine grained: just does the locking on the picked nodes,
> so if scheduler A picks node1, and B picks node42 they will not interfere
> with
> each other.
> Optimistic because it does not acquire the locks at the beginning (for the
> time consuming selection),
> it just gets them at the end for very short time.
> 
> The example did locked all ~ 10000 node at the beginning.
The example did NOT locked all ~ 10000 node at the beginning.
> 
> The simple way to avoiding to multiple scheduler picks the same node,
> is adding some randomness.
> Usually multiple destination can be considered as good decision.
> 
> Also possible to have the scheduler to pick multiple candidates at the first
> part,
> and the final part just pick from like 10 candidates.
> 
> Very likely other possible solutions will not be liked, because
> they need to do much more against BasicDesignTenets [3],
> in-order to be efficient.
> 
> Please remember the original problem was we cannot,
> run something on 2 process safely.
> 
> Creating only one very efficient active process also something,
> what can improve the throughput in the current situation.
> 
> 
> [3] https://wiki.openstack.org/wiki/BasicDesignTenets
> > N.
> > 
> > [1] https://etherpad.openstack.org/p/kilo-nova-midcycle
> > [2] http://research.google.com/pubs/pub41684.html
> > 
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list