[openstack-dev] Scheduler proposal

Ed Leafe ed at leafe.com
Thu Oct 8 02:53:55 UTC 2015

On Oct 7, 2015, at 6:00 PM, Chris Friesen <chris.friesen at windriver.com> wrote:

> I've wondered for a while (ever since I looked at the scheduler code, really) why we couldn't implement more of the scheduler as database transactions.
> I haven't used Cassandra, so maybe you can clarify something about updates across a distributed DB.  I just read up on lightweight transactions, and it says that they're restricted to a single partition.  Is that an acceptable limitation for this usage?

An implementation detail. A partition is defined by the partition key, not by any physical arrangement of nodes. The partition key would have to depend on the resource type, and whatever other columns would make such a query unique.

> Some points that might warrant further discussion:
> 1) Some resources (RAM) only require tracking amounts.  Other resources (CPUs, PCI devices) require tracking allocation of specific individual host resources (for CPU pinning, PCI device allocation, etc.).  Presumably for the latter we would have to actually do the allocation of resources at the time of the scheduling operation in order to update the database with the claimed resources in a race-free way.

Yes, that's correct. A lot of thought would have to be put into how to best represent these different types of resources, and that's something that I have ideas about, but would feel a whole lot better defining only after talking these concepts over with others who understand the underlying concepts better than I do.

> 2) Are you suggesting that all of nova switch to Cassandra, or just the scheduler and resource tracking portions?  If the latter, how would we handle things like pinned CPUs and PCI devices that are currently associated with specific instances in the nova DB?

I am only thinking of the scheduler as a separate service. Perhaps Nova as a whole might benefit from switching to Cassandra for its database needs, but I haven't really thought about that at all.

> 3) The concept of the compute node updating the DB when things change is really orthogonal to the new scheduling model.  The current scheduling model would benefit from that as well.

Actually, it isn't that different. Compute nodes send updates to the scheduler when instances are created/deleted/resized/etc., so this isn't much of a stretch.

> 4) It seems to me that to avoid races we need to do one of the following.  Which are you proposing?
> a) Serialize the entire scheduling operation so that only one instance can schedule at once.
> b) Make the evaluation of filters and claiming of resources a single atomic DB transaction.
> c) Do a loop where we evaluate the filters, pick a destination, try to claim the resources in the DB, and retry the whole thing if the resources have already been claimed.

Probably a combination of b) and c). Filters would, for lack of a better term, add CSQL WHERE clauses to the query, which would return a set of acceptable hosts. Weighers would order these hosts in terms of desirability, and then the claim would be attempted. If the claim failed because the host had changed, the next acceptable host would be selected, etc. I don't imagine that "retrying the whole thing" would be an efficient option, unless there were no other acceptable hosts returned from the original filtering query.

Put another way: if we are in a racy situation, and two scheduler processes are trying to place a similar instance, both processes would most likely come up with the same set of hosts ordered in the same way. One of those processes would "win", and claim the first choice. The other would fail the transaction, and would then claim the second choice on the list. IMO, this is how you best deal with race conditions.

-- Ed Leafe

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151007/4a65f03f/attachment-0001.pgp>

More information about the OpenStack-dev mailing list