[openstack-dev] [nova][scheduler][placement] Allocating Complex Resources

Ed Leafe ed at leafe.com
Mon Jun 5 21:22:06 UTC 2017

We had a very lively discussion this morning during the Scheduler subteam meeting, which was continued in a Google hangout. The subject was how to handle claiming resources when the Resource Provider is not "simple". By "simple", I mean a compute node that provides all of the resources itself, as contrasted with a compute node that uses a shared storage for disk space, or which has complex nested relationships with things such as PCI devices or NUMA nodes. The current situation is as follows:

a) scheduler gets a request with certain resource requirements (RAM, disk, CPU, etc.)
b) scheduler passes these resource requirements to placement, which returns a list of hosts (compute nodes) that can satisfy the request.
c) scheduler runs these through some filters and weighers to get a list ordered by best "fit"
d) it then tries to claim the resources, by posting to placement allocations for these resources against the selected host
e) once the allocation succeeds, scheduler returns that host to conductor to then have the VM built

(some details for edge cases left out for clarity of the overall process)

The problem we discussed comes into play when the compute node isn't the actual provider of the resources. The easiest example to consider is when the computes are associated with a shared storage provider. The placement query is smart enough to know that even if the compute node doesn't have enough local disk, it will get it from the shared storage, so it will return that host in step b) above. If the scheduler then chooses that host, when it tries to claim it, it will pass the resources and the compute node UUID back to placement to make the allocations. This is the point where the current code would fall short: somehow, placement needs to know to allocate the disk requested against the shared storage provider, and not the compute node.

One proposal is to essentially use the same logic in placement that was used to include that host in those matching the requirements. In other words, when it tries to allocate the amount of disk, it would determine that that host is in a shared storage aggregate, and be smart enough to allocate against that provider. This was referred to in our discussion as "Plan A".

Another proposal involved a change to how placement responds to the scheduler. Instead of just returning the UUIDs of the compute nodes that satisfy the required resources, it would include a whole bunch of additional information in a structured response. A straw man example of such a response is here: https://etherpad.openstack.org/p/placement-allocations-straw-man. This was referred to as "Plan B". The main feature of this approach is that part of that response would be the JSON dict for the allocation call, containing the specific resource provider UUID for each resource. This way, when the scheduler selects a host, it would simply pass that dict back to the /allocations call, and placement would be able to do the allocations directly against that information.

There was another issue raised: simply providing the host UUIDs didn't give the scheduler enough information in order to run its filters and weighers. Since the scheduler uses those UUIDs to construct HostState objects, the specific missing information was never completely clarified, so I'm just including this aspect of the conversation for completeness. It is orthogonal to the question of how to allocate when the resource provider is not "simple".

My current feeling is that we got ourselves into our existing mess of ugly, convoluted code when we tried to add these complex relationships into the resource tracker and the scheduler. We set out to create the placement engine to bring some sanity back to how we think about things we need to virtualize. I would really hate to see us make the same mistake again, by adding a good deal of complexity to handle a few non-simple cases. What I would like to avoid, no matter what the eventual solution chosen, is representing this complexity in multiple places. Currently the only two candidates for this logic are the placement engine, which knows about these relationships already, or the compute service itself, which has to handle the management of these complex virtualized resources.

I don't know the answer. I'm hoping that we can have a discussion that might uncover a clear approach, or, at the very least, one that is less murky than the others.

-- Ed Leafe

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170605/2b0fb06c/attachment.sig>

More information about the OpenStack-dev mailing list