[openstack-dev] [nova] Proposal: Move CPU and memory allocation ratio out of scheduler

John Garbutt john at johngarbutt.com
Wed Jun 4 08:29:36 UTC 2014


On 3 June 2014 14:29, Jay Pipes <jaypipes at gmail.com> wrote:
> tl;dr
> =====
>
> Move CPU and RAM allocation ratio definition out of the Nova scheduler and
> into the resource tracker. Remove the calculations for overcommit out of the
> core_filter and ram_filter scheduler pieces.

+1

I hope to see us send more specific stats to the scheduler, that each
filter/weigher can then interpret.

The extensible system then means you can optimise what you send down
the scheduler to a minimum. The next step is doing differential
updates, with more infrequent full sync updates. But we are getting
there.

As you say, I love that we do the calculation once per host, not once
per request. It plays really well with the caching scheduler work, and
the new build and run instance flow that help work towards the
scheduler process(es) doing the bare minimum on each user request.

> Details
> =======
>
> Currently, in the Nova code base, the thing that controls whether or not the
> scheduler places an instance on a compute host that is already "full" (in
> terms of memory or vCPU usage) is a pair of configuration options* called
> cpu_allocation_ratio and ram_allocation_ratio.
>
> These configuration options are defined in, respectively,
> nova/scheduler/filters/core_filter.py and
> nova/scheduler/filters/ram_filter.py.
>
> Every time an instance is launched, the scheduler loops through a collection
> of host state structures that contain resource consumption figures for each
> compute node. For each compute host, the core_filter and ram_filter's
> host_passes() method is called. In the host_passes() method, the host's
> reported total amount of CPU or RAM is multiplied by this configuration
> option, and the product is then subtracted from the reported used amount of
> CPU or RAM. If the result is greater than or equal to the number of vCPUs
> needed by the instance being launched, True is returned and the host
> continues to be considered during scheduling decisions.
>
> I propose we move the definition of the allocation ratios out of the
> scheduler entirely, as well as the calculation of the total amount of
> resources each compute node contains. The resource tracker is the most
> appropriate place to define these configuration options, as the resource
> tracker is what is responsible for keeping track of total and used resource
> amounts for all compute nodes.

+1

> Benefits:
>
>  * Allocation ratios determine the amount of resources that a compute node
> advertises. The resource tracker is what determines the amount of resources
> that each compute node has, and how much of a particular type of resource
> have been used on a compute node. It therefore makes sense to put
> calculations and definition of allocation ratios where they naturally
> belong.
>  * The scheduler currently needlessly re-calculates total resource amounts
> on every call to the scheduler. This isn't necessary. The total resource
> amounts don't change unless either a configuration option is changed on a
> compute node (or host aggregate), and this calculation can be done more
> efficiently once in the resource tracker.
>  * Move more logic out of the scheduler
>  * With the move to an extensible resource tracker, we can more easily
> evolve to defining all resource-related options in the same place (instead
> of in different filter files in the scheduler...)

+1

Thats a much nicer solution than shoving info from the aggregate into
the scheduler. Great to avoid that were possible.


Now there are limits to this, I think. Some examples that to mind:
* For per aggregate ratios, we just report the free resources, taking
into account the ratio. (as above)
* For the availability zone filter, each host should report its
availability zone to the scheduler
* If we have filters that adjust the ratio per flavour, we will still
need that calculation in the scheduler, but thats cool


In general, the approach I am advocating is:
* each host provides the data needed for the filter / weightier
* ideally in a way that requires minimal processing

And after some IRC discussions with Dan Smith, he pointed out that we
need to think about:
* with data versioned in a way that supports live-upgrades


Thanks,
John



More information about the OpenStack-dev mailing list