[openstack-dev] Scheduler proposal

Ian Wells ijw.ubuntu at cack.org.uk
Thu Oct 8 01:23:43 UTC 2015

On 7 October 2015 at 16:00, Chris Friesen <chris.friesen at windriver.com>

> 1) Some resources (RAM) only require tracking amounts.  Other resources
> (CPUs, PCI devices) require tracking allocation of specific individual host
> resources (for CPU pinning, PCI device allocation, etc.).  Presumably for
> the latter we would have to actually do the allocation of resources at the
> time of the scheduling operation in order to update the database with the
> claimed resources in a race-free way.

The whole process is inherently racy (and this is inevitable, and correct),
which is why the scheduler works the way it does:

- scheduler guesses at a host based on (guaranteed - hello distributed
systems!) outdated information
- VM is scheduled to a host that looks like it might work, and host
attempts to run it
- VM run may fail (because the information was outdated or has become
outdated), in which case we retry the schedule

In fact, with PCI devices the code has been written rather carefully to
make sure that they fit into this model.  There is central per-device
tracking (which, fwiw, I argued against back in the day) but that's not how
allocation works (or, considering how long it is since I looked, worked).

PCI devices are actually allocated from pools of equivalent devices, and
allocation works in the same manner as other scheduling: you work out from
the nova boot call what constraints a host must satisfy (in this case, in
number of PCI devices in specific pools), you check your best guess at
global host state against those constraints, and you pick one of the hosts
that meets the constraints to schedule on.

So: yes, there is a central registry of devices, which we try to keep up to
date - but this is for admins to refer to, it's not a necessity of
scheduling.  The scheduler input is the pool counts, which work largely the
same way as the available memory works as regards scheduling and updating.

No idea on CPUs, sorry, but again I'm not sure why the behaviour would be
any different: compare suspected host state against needs, schedule if it
fits, hope you got it right and tolerate if you didn't.

That being the case, it's worth noting that the database can be eventually
consistent and doesn't need to be transactional.  It's also worth
considering that the database can have multiple (mutually inconsistent)
copies.  There's no need to use a central datastore if you don't want to -
one theoretical example is to run multiple schedulers and let each
scheduler attempt to collate cloud state from unreliable messages from the
compute hosts.  This is not quite what happens today, because messages we
send over Rabbit are reliable and therefore costly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151007/f44ac85c/attachment.html>

More information about the OpenStack-dev mailing list