[openstack-dev] [nova] Update on scheduler and resource tracker progress

Jay Pipes jaypipes at gmail.com
Thu Feb 11 20:24:04 UTC 2016

Hello all,

Performance working group, please pay attention to Chapter 2 in the 
details section.


At the Nova mid-cycle, we finalized decisions on a way forward in 
redesigning the way that resources are tracked in Nova. This work is a 
major undertaking and has implications for splitting out the scheduler 
from Nova, for the ability of the placement engine to scale, and for 
removing long-standing reporting and race condition bugs that have 
plagued Nova for years.

The following blueprint specifications outline the effort, which we are 
calling the "resource providers framework":

* resource-classes (bp MERGED, code MERGED)
* pci-generate-stats (bp MERGED, code IN REVIEW)
* resource-providers (bp MERGED, code IN REVIEW)
* generic-resource-pools (bp IN REVIEW, code TODO)
* compute-node-inventory (bp IN REVIEW, code TODO)
* resource-providers-allocations (bp IN REVIEW, code TODO)
* resource-providers-scheduler (bp IN REVIEW, code TODO)

The group working on this code and doing the reviews are hopeful that 
the generic-resource-pools work can be completed in Mitaka, and we also 
are going to aim to get the compute-node-inventory work done in Mitaka, 
though that will be more of a stretch.

The remainder of the resource providers framework blueprints will be 
targeted to Newton. The resource-providers-scheduler blueprint is the 
final blueprint required before the scheduler can be fully separated 
from Nova.


Chapter 1 - How the blueprints fit together

A request to launch an instance in Nova involves requests for two 
different things: *resources* and *capabilities*. Resources are the 
quantitative part of the request spec. Capabilities are the qualitative 
part of the request.

The *resource providers framework* is a set of 7 blueprints that 
reorganize the way that Nova handles the quantitative side of the 
equation. These 7 blueprints are described below.

Compute nodes are a type of *resource provider*, since they allow 
instances to *consume* some portion of its *inventory* of various types 
of resources. We call these types of resources *"resource classes"*.

resource-classes bp: https://review.openstack.org/256297

The resource-providers blueprint introduces a new set of tables for 
storing capacity and usage amounts of all resources in the system:

resource-providers bp: https://review.openstack.org/225546

While all compute nodes are resource providers [1], not all resource 
providers are compute nodes. *Generic resource pools* are resource 
providers that have an inventory of a *single resource class* and that 
provide that resource class to consumers that are placed on multiple 
compute nodes.

The canonical example of a generic resource pool is a shared storage 
system. Currently, a Nova compute node doesn't really know whether the 
storage location it uses for storing disk images is a shared 
drive/cluster (ala NFS or RBD) or if the storage location is a local 
disk drive [2]. The generic-resource-pools blueprint covers the addition 
of these generic resource pools, their relation to host aggregates, and 
the RESTful API [3] added to control this external resource pool 

generic-resource-pools bp: https://review.openstack.org/253187

Within the Nova database schemas [4], capacity and inventory information 
is stored in a variety of tables, columns and formats. vCPU, RAM and 
DISK capacity information is stored in integer fields, PCI capacity 
information is stored in the pci_devices table, NUMA inventory is stored 
combined together with usage information in a JSON blob, etc. The 
compute-node-inventory blueprint migrates all of the disparate capacity 
information from compute_nodes into the new inventory table.

compute-node-inventory bp: https://review.openstack.org/260048

For the PCI resource classes, Nova currently has an entirely different 
resource tracker (in /nova/pci/*) that stores an aggregate view of the 
PCI resources (grouped by product, vendor, and numa node) in the 
compute_nodes.pci_stats field. This information is entirely redundant 
information since all fine-grained PCI resource information is stored in 
the pci_devices table. This storage of summary information presents a 
sync problem. The pci-generate-stats blueprint describes the effort to 
remove this storage of summary device pool information and instead 
generate this summary information on the fly for the scheduler. This 
work is a pre-requisite to having all resource classes managed in a 
unified manner in Nova:

pci-generate-stats bp: https://review.openstack.org/240852

In the same way that capacity fields are scattered among different 
tables, columns and formats, so too are the fields that store usage 
information. Some fields are in the instances table, some in the 
instance_extra table, some information is derived from the pci_devices 
table, other bits from a JSON blob field. In short, it's an inconsistent 
mess. This mess means adding support for adding additional types of 
resources typically involves adding yet more inconsistency and 
conditional logic into the scheduler and nova-compute's resource 
tracker. The resource-providers-allocations blueprint involves work to 
migrate all usage record information out of the disparate fields in the 
current schema and into the allocations table introduced in the 
resource-providers blueprint:

resource-providers-allocations bp: https://review.openstack.org/271779

Once all of the inventory (capacity) and allocation (usage) information 
has been migrated to the database schema described in the 
resource-providers blueprint, Nova will be treating all types of 
resources in a generic fashion. The next step is to modify the scheduler 
to take advantage of this new resource representation. The 
resource-providers-scheduler blueprint undertakes this important step:

resource-providers-scheduler bp: https://review.openstack.org/271823

Chapter 2 - Addressing performance and scale

One of the significant performance problems with the Nova scheduler is 
the fact that for every call to the select_destinations() RPC API method 
-- which itself is called at least once every time a launch or migration 
request is made -- the scheduler grabs all records for all compute nodes 
in the deployment. Once retrieving all these compute node records, the 
scheduler runs each through a set of filters to determine which compute 
nodes have the required capacity to service the instance's requested 
resources. Having the scheduler continually retrieve every compute node 
record on each request to select_destinations() is extremely 
inefficient. The greater the number of compute nodes, the bigger the 
performance and scale problem this becomes.

On a loaded cloud deployment -- say there are 1000 compute nodes and 900 
of them are fully loaded with active virtual machines -- the scheduler 
is still going to retrieve all 1000 compute node records on every 
request to select_destinations() and process each one of those records 
through all scheduler filters. Clearly, if we could filter the amount of 
compute node records that are returned by removing those nodes that do 
not have available capacity, we could dramatically reduce the amount of 
work that each call to select_destinations() would need to perform.

The resource-providers-scheduler blueprint attempts to address the above 
problem by replacing a number of the scheduler filters that currently 
run *after* the database has returned all compute node records with 
instead a series of WHERE clauses and join conditions on the database 
query. The idea here is to winnow the number of returned compute node 
results as much as possible. The fewer records the scheduler must 
post-process, the faster the performance of each individual call to 

The second major scale problem with the current Nova scheduler design 
has to do with the fact that the scheduler does *not* actually claim 
resources on a provider. Instead, the scheduler selects a destination 
host to place the instance on and the Nova conductor then sends a 
message to that target host which attempts to spawn the instance on its 
hypervisor. If the spawn succeeds, the target compute host updates the 
Nova database and decrements its count of available resources. These 
steps (from nova-scheduler to nova-conductor to nova-compute to 
database) all take some not insignificant amount of time. During this 
time window, a different scheduler process may pick the exact same 
target host for a like-sized launch request. If there is only room on 
the target host for one of those size requests [5], one of those spawn 
requests will fail and trigger a retry operation. This retry operation 
will attempt to repeat the scheduler placement decisions (by calling 

This retry operation is relatively expensive and needlessly so: if the 
scheduler claimed the resources on the target host before sending its 
pick back to the scheduler, then the chances of producing a retry will 
be almost eliminated [6]. The resource-providers-scheduler blueprint 
attempts to remedy this second scaling design problem by having the 
scheduler write records to the allocations table before sending the 
selected target host back to the Nova conductor.


Thanks if you've made it this far in this epic email. :) If you have 
questions about the plans, please do feel free to respond here or come 
find us on Freenode #openstack-nova IRC. Your reviews and comments are 
also very welcome on the specs and patches.


[1] One might argue that nova-compute daemons that proxy for some other 
resource manager like vCenter or Ironic are not actually resource 
providers, but just go with me on this one...

[2] This results in a number of resource reporting bugs, including Nova 
reporting that the deployment has X times as much disk capacity as it 
really does (where X is the number of compute nodes sharing the same 
storage location).

[3] The RESTful API in the generic-resource-pools blueprint actually 
will be a completely new REST endpoint and service (/placement) that 
will be the start of the new extracted schd

[4] Nova has two database schemas. The first is what is known as the 
Child Cell database and contains the majority of database tables. The 
second is known as the API database and contains global and top-level 
routing tables.

[5] This situation is more common than you might originally think. Any 
cloud that runs a pack-first placement strategy with multiple scheduler 
daemon processes will suffer from this problem.

[6] Technically, it cannot be eliminated because an out-of-band 
operation could theoretically occur (for example, an administrator could 
manually -- not through Nova -- launch a virtual machine on the target 
host) and therefore introduce some unaccounted-for amount of used 
resources for a small window of time in between the periodic interval by 
which the nova-compute runs an audit task.

More information about the OpenStack-dev mailing list