[openstack-dev] [nova] Update on scheduler and resource tracker progress
Jay Pipes
jaypipes at gmail.com
Thu Feb 11 20:24:04 UTC 2016
Hello all,
Performance working group, please pay attention to Chapter 2 in the
details section.
tl;dr
-----
At the Nova mid-cycle, we finalized decisions on a way forward in
redesigning the way that resources are tracked in Nova. This work is a
major undertaking and has implications for splitting out the scheduler
from Nova, for the ability of the placement engine to scale, and for
removing long-standing reporting and race condition bugs that have
plagued Nova for years.
The following blueprint specifications outline the effort, which we are
calling the "resource providers framework":
* resource-classes (bp MERGED, code MERGED)
* pci-generate-stats (bp MERGED, code IN REVIEW)
* resource-providers (bp MERGED, code IN REVIEW)
* generic-resource-pools (bp IN REVIEW, code TODO)
* compute-node-inventory (bp IN REVIEW, code TODO)
* resource-providers-allocations (bp IN REVIEW, code TODO)
* resource-providers-scheduler (bp IN REVIEW, code TODO)
The group working on this code and doing the reviews are hopeful that
the generic-resource-pools work can be completed in Mitaka, and we also
are going to aim to get the compute-node-inventory work done in Mitaka,
though that will be more of a stretch.
The remainder of the resource providers framework blueprints will be
targeted to Newton. The resource-providers-scheduler blueprint is the
final blueprint required before the scheduler can be fully separated
from Nova.
details
-------
Chapter 1 - How the blueprints fit together
===========================================
A request to launch an instance in Nova involves requests for two
different things: *resources* and *capabilities*. Resources are the
quantitative part of the request spec. Capabilities are the qualitative
part of the request.
The *resource providers framework* is a set of 7 blueprints that
reorganize the way that Nova handles the quantitative side of the
equation. These 7 blueprints are described below.
Compute nodes are a type of *resource provider*, since they allow
instances to *consume* some portion of its *inventory* of various types
of resources. We call these types of resources *"resource classes"*.
resource-classes bp: https://review.openstack.org/256297
The resource-providers blueprint introduces a new set of tables for
storing capacity and usage amounts of all resources in the system:
resource-providers bp: https://review.openstack.org/225546
While all compute nodes are resource providers [1], not all resource
providers are compute nodes. *Generic resource pools* are resource
providers that have an inventory of a *single resource class* and that
provide that resource class to consumers that are placed on multiple
compute nodes.
The canonical example of a generic resource pool is a shared storage
system. Currently, a Nova compute node doesn't really know whether the
storage location it uses for storing disk images is a shared
drive/cluster (ala NFS or RBD) or if the storage location is a local
disk drive [2]. The generic-resource-pools blueprint covers the addition
of these generic resource pools, their relation to host aggregates, and
the RESTful API [3] added to control this external resource pool
information.
generic-resource-pools bp: https://review.openstack.org/253187
Within the Nova database schemas [4], capacity and inventory information
is stored in a variety of tables, columns and formats. vCPU, RAM and
DISK capacity information is stored in integer fields, PCI capacity
information is stored in the pci_devices table, NUMA inventory is stored
combined together with usage information in a JSON blob, etc. The
compute-node-inventory blueprint migrates all of the disparate capacity
information from compute_nodes into the new inventory table.
compute-node-inventory bp: https://review.openstack.org/260048
For the PCI resource classes, Nova currently has an entirely different
resource tracker (in /nova/pci/*) that stores an aggregate view of the
PCI resources (grouped by product, vendor, and numa node) in the
compute_nodes.pci_stats field. This information is entirely redundant
information since all fine-grained PCI resource information is stored in
the pci_devices table. This storage of summary information presents a
sync problem. The pci-generate-stats blueprint describes the effort to
remove this storage of summary device pool information and instead
generate this summary information on the fly for the scheduler. This
work is a pre-requisite to having all resource classes managed in a
unified manner in Nova:
pci-generate-stats bp: https://review.openstack.org/240852
In the same way that capacity fields are scattered among different
tables, columns and formats, so too are the fields that store usage
information. Some fields are in the instances table, some in the
instance_extra table, some information is derived from the pci_devices
table, other bits from a JSON blob field. In short, it's an inconsistent
mess. This mess means adding support for adding additional types of
resources typically involves adding yet more inconsistency and
conditional logic into the scheduler and nova-compute's resource
tracker. The resource-providers-allocations blueprint involves work to
migrate all usage record information out of the disparate fields in the
current schema and into the allocations table introduced in the
resource-providers blueprint:
resource-providers-allocations bp: https://review.openstack.org/271779
Once all of the inventory (capacity) and allocation (usage) information
has been migrated to the database schema described in the
resource-providers blueprint, Nova will be treating all types of
resources in a generic fashion. The next step is to modify the scheduler
to take advantage of this new resource representation. The
resource-providers-scheduler blueprint undertakes this important step:
resource-providers-scheduler bp: https://review.openstack.org/271823
Chapter 2 - Addressing performance and scale
============================================
One of the significant performance problems with the Nova scheduler is
the fact that for every call to the select_destinations() RPC API method
-- which itself is called at least once every time a launch or migration
request is made -- the scheduler grabs all records for all compute nodes
in the deployment. Once retrieving all these compute node records, the
scheduler runs each through a set of filters to determine which compute
nodes have the required capacity to service the instance's requested
resources. Having the scheduler continually retrieve every compute node
record on each request to select_destinations() is extremely
inefficient. The greater the number of compute nodes, the bigger the
performance and scale problem this becomes.
On a loaded cloud deployment -- say there are 1000 compute nodes and 900
of them are fully loaded with active virtual machines -- the scheduler
is still going to retrieve all 1000 compute node records on every
request to select_destinations() and process each one of those records
through all scheduler filters. Clearly, if we could filter the amount of
compute node records that are returned by removing those nodes that do
not have available capacity, we could dramatically reduce the amount of
work that each call to select_destinations() would need to perform.
The resource-providers-scheduler blueprint attempts to address the above
problem by replacing a number of the scheduler filters that currently
run *after* the database has returned all compute node records with
instead a series of WHERE clauses and join conditions on the database
query. The idea here is to winnow the number of returned compute node
results as much as possible. The fewer records the scheduler must
post-process, the faster the performance of each individual call to
select_destinations().
The second major scale problem with the current Nova scheduler design
has to do with the fact that the scheduler does *not* actually claim
resources on a provider. Instead, the scheduler selects a destination
host to place the instance on and the Nova conductor then sends a
message to that target host which attempts to spawn the instance on its
hypervisor. If the spawn succeeds, the target compute host updates the
Nova database and decrements its count of available resources. These
steps (from nova-scheduler to nova-conductor to nova-compute to
database) all take some not insignificant amount of time. During this
time window, a different scheduler process may pick the exact same
target host for a like-sized launch request. If there is only room on
the target host for one of those size requests [5], one of those spawn
requests will fail and trigger a retry operation. This retry operation
will attempt to repeat the scheduler placement decisions (by calling
select_destinations()).
This retry operation is relatively expensive and needlessly so: if the
scheduler claimed the resources on the target host before sending its
pick back to the scheduler, then the chances of producing a retry will
be almost eliminated [6]. The resource-providers-scheduler blueprint
attempts to remedy this second scaling design problem by having the
scheduler write records to the allocations table before sending the
selected target host back to the Nova conductor.
Conclusions
===========
Thanks if you've made it this far in this epic email. :) If you have
questions about the plans, please do feel free to respond here or come
find us on Freenode #openstack-nova IRC. Your reviews and comments are
also very welcome on the specs and patches.
Best,
-jay
[1] One might argue that nova-compute daemons that proxy for some other
resource manager like vCenter or Ironic are not actually resource
providers, but just go with me on this one...
[2] This results in a number of resource reporting bugs, including Nova
reporting that the deployment has X times as much disk capacity as it
really does (where X is the number of compute nodes sharing the same
storage location).
[3] The RESTful API in the generic-resource-pools blueprint actually
will be a completely new REST endpoint and service (/placement) that
will be the start of the new extracted schd
[4] Nova has two database schemas. The first is what is known as the
Child Cell database and contains the majority of database tables. The
second is known as the API database and contains global and top-level
routing tables.
[5] This situation is more common than you might originally think. Any
cloud that runs a pack-first placement strategy with multiple scheduler
daemon processes will suffer from this problem.
[6] Technically, it cannot be eliminated because an out-of-band
operation could theoretically occur (for example, an administrator could
manually -- not through Nova -- launch a virtual machine on the target
host) and therefore introduce some unaccounted-for amount of used
resources for a small window of time in between the periodic interval by
which the nova-compute runs an audit task.
More information about the OpenStack-dev
mailing list