[openstack-dev] [nova] Update on scheduler and resource tracker progress

Ryan Rossiter rlrossit at linux.vnet.ibm.com
Fri Feb 12 16:07:01 UTC 2016


> On Feb 11, 2016, at 2:24 PM, Jay Pipes <jaypipes at gmail.com> wrote:
> 
> Hello all,
> 
> Performance working group, please pay attention to Chapter 2 in the details section.
> 
> tl;dr
> -----
> 
> At the Nova mid-cycle, we finalized decisions on a way forward in redesigning the way that resources are tracked in Nova. This work is a major undertaking and has implications for splitting out the scheduler from Nova, for the ability of the placement engine to scale, and for removing long-standing reporting and race condition bugs that have plagued Nova for years.
> 
> The following blueprint specifications outline the effort, which we are calling the "resource providers framework":
> 
> * resource-classes (bp MERGED, code MERGED)
> * pci-generate-stats (bp MERGED, code IN REVIEW)
> * resource-providers (bp MERGED, code IN REVIEW)
> * generic-resource-pools (bp IN REVIEW, code TODO)
> * compute-node-inventory (bp IN REVIEW, code TODO)
> * resource-providers-allocations (bp IN REVIEW, code TODO)
> * resource-providers-scheduler (bp IN REVIEW, code TODO)
> 
> The group working on this code and doing the reviews are hopeful that the generic-resource-pools work can be completed in Mitaka, and we also are going to aim to get the compute-node-inventory work done in Mitaka, though that will be more of a stretch.
> 
> The remainder of the resource providers framework blueprints will be targeted to Newton. The resource-providers-scheduler blueprint is the final blueprint required before the scheduler can be fully separated from Nova.
> 
> details
> -------
> 
> Chapter 1 - How the blueprints fit together
> ===========================================
> 
> A request to launch an instance in Nova involves requests for two different things: *resources* and *capabilities*. Resources are the quantitative part of the request spec. Capabilities are the qualitative part of the request.
> 
> The *resource providers framework* is a set of 7 blueprints that reorganize the way that Nova handles the quantitative side of the equation. These 7 blueprints are described below.
> 
> Compute nodes are a type of *resource provider*, since they allow instances to *consume* some portion of its *inventory* of various types of resources. We call these types of resources *"resource classes"*.
> 
> resource-classes bp: https://review.openstack.org/256297
> 
> The resource-providers blueprint introduces a new set of tables for storing capacity and usage amounts of all resources in the system:
> 
> resource-providers bp: https://review.openstack.org/225546
> 
> While all compute nodes are resource providers [1], not all resource providers are compute nodes. *Generic resource pools* are resource providers that have an inventory of a *single resource class* and that provide that resource class to consumers that are placed on multiple compute nodes.
> 
> The canonical example of a generic resource pool is a shared storage system. Currently, a Nova compute node doesn't really know whether the storage location it uses for storing disk images is a shared drive/cluster (ala NFS or RBD) or if the storage location is a local disk drive [2]. The generic-resource-pools blueprint covers the addition of these generic resource pools, their relation to host aggregates, and the RESTful API [3] added to control this external resource pool information.
> 
> generic-resource-pools bp: https://review.openstack.org/253187
> 
> Within the Nova database schemas [4], capacity and inventory information is stored in a variety of tables, columns and formats. vCPU, RAM and DISK capacity information is stored in integer fields, PCI capacity information is stored in the pci_devices table, NUMA inventory is stored combined together with usage information in a JSON blob, etc. The compute-node-inventory blueprint migrates all of the disparate capacity information from compute_nodes into the new inventory table.
> 
> compute-node-inventory bp: https://review.openstack.org/260048
> 
> For the PCI resource classes, Nova currently has an entirely different resource tracker (in /nova/pci/*) that stores an aggregate view of the PCI resources (grouped by product, vendor, and numa node) in the compute_nodes.pci_stats field. This information is entirely redundant information since all fine-grained PCI resource information is stored in the pci_devices table. This storage of summary information presents a sync problem. The pci-generate-stats blueprint describes the effort to remove this storage of summary device pool information and instead generate this summary information on the fly for the scheduler. This work is a pre-requisite to having all resource classes managed in a unified manner in Nova:
> 
> pci-generate-stats bp: https://review.openstack.org/240852
> 
> In the same way that capacity fields are scattered among different tables, columns and formats, so too are the fields that store usage information. Some fields are in the instances table, some in the instance_extra table, some information is derived from the pci_devices table, other bits from a JSON blob field. In short, it's an inconsistent mess. This mess means adding support for adding additional types of resources typically involves adding yet more inconsistency and conditional logic into the scheduler and nova-compute's resource tracker. The resource-providers-allocations blueprint involves work to migrate all usage record information out of the disparate fields in the current schema and into the allocations table introduced in the resource-providers blueprint:
> 
> resource-providers-allocations bp: https://review.openstack.org/271779
> 
> Once all of the inventory (capacity) and allocation (usage) information has been migrated to the database schema described in the resource-providers blueprint, Nova will be treating all types of resources in a generic fashion. The next step is to modify the scheduler to take advantage of this new resource representation. The resource-providers-scheduler blueprint undertakes this important step:
> 
> resource-providers-scheduler bp: https://review.openstack.org/271823
> 
> Chapter 2 - Addressing performance and scale
> ============================================
> 
> One of the significant performance problems with the Nova scheduler is the fact that for every call to the select_destinations() RPC API method -- which itself is called at least once every time a launch or migration request is made -- the scheduler grabs all records for all compute nodes in the deployment. Once retrieving all these compute node records, the scheduler runs each through a set of filters to determine which compute nodes have the required capacity to service the instance's requested resources. Having the scheduler continually retrieve every compute node record on each request to select_destinations() is extremely inefficient. The greater the number of compute nodes, the bigger the performance and scale problem this becomes.
> 
> On a loaded cloud deployment -- say there are 1000 compute nodes and 900 of them are fully loaded with active virtual machines -- the scheduler is still going to retrieve all 1000 compute node records on every request to select_destinations() and process each one of those records through all scheduler filters. Clearly, if we could filter the amount of compute node records that are returned by removing those nodes that do not have available capacity, we could dramatically reduce the amount of work that each call to select_destinations() would need to perform.
> 
> The resource-providers-scheduler blueprint attempts to address the above problem by replacing a number of the scheduler filters that currently run *after* the database has returned all compute node records with instead a series of WHERE clauses and join conditions on the database query. The idea here is to winnow the number of returned compute node results as much as possible. The fewer records the scheduler must post-process, the faster the performance of each individual call to select_destinations().
> 
> The second major scale problem with the current Nova scheduler design has to do with the fact that the scheduler does *not* actually claim resources on a provider. Instead, the scheduler selects a destination host to place the instance on and the Nova conductor then sends a message to that target host which attempts to spawn the instance on its hypervisor. If the spawn succeeds, the target compute host updates the Nova database and decrements its count of available resources. These steps (from nova-scheduler to nova-conductor to nova-compute to database) all take some not insignificant amount of time. During this time window, a different scheduler process may pick the exact same target host for a like-sized launch request. If there is only room on the target host for one of those size requests [5], one of those spawn requests will fail and trigger a retry operation. This retry operation will attempt to repeat the scheduler placement decisions (by calling select_destinations()).
> 
> This retry operation is relatively expensive and needlessly so: if the scheduler claimed the resources on the target host before sending its pick back to the scheduler, then the chances of producing a retry will be almost eliminated [6]. The resource-providers-scheduler blueprint attempts to remedy this second scaling design problem by having the scheduler write records to the allocations table before sending the selected target host back to the Nova conductor.
> 
> Conclusions
> ===========
> 
> Thanks if you've made it this far in this epic email. :) If you have questions about the plans, please do feel free to respond here or come find us on Freenode #openstack-nova IRC. Your reviews and comments are also very welcome on the specs and patches.
> 
> Best,
> -jay
> 
> [1] One might argue that nova-compute daemons that proxy for some other resource manager like vCenter or Ironic are not actually resource providers, but just go with me on this one...
> 
> [2] This results in a number of resource reporting bugs, including Nova reporting that the deployment has X times as much disk capacity as it really does (where X is the number of compute nodes sharing the same storage location).
> 
> [3] The RESTful API in the generic-resource-pools blueprint actually will be a completely new REST endpoint and service (/placement) that will be the start of the new extracted schd
> 
> [4] Nova has two database schemas. The first is what is known as the Child Cell database and contains the majority of database tables. The second is known as the API database and contains global and top-level routing tables.
> 
> [5] This situation is more common than you might originally think. Any cloud that runs a pack-first placement strategy with multiple scheduler daemon processes will suffer from this problem.
> 
> [6] Technically, it cannot be eliminated because an out-of-band operation could theoretically occur (for example, an administrator could manually -- not through Nova -- launch a virtual machine on the target host) and therefore introduce some unaccounted-for amount of used resources for a small window of time in between the periodic interval by which the nova-compute runs an audit task.
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

Seeing objects changes come in for this has made me feel like I should be helping out with the review load. But without being at the midcycle, it felt like I didn’t know what was going on with these because “you had to be there”. This summary helps me follow the “why” behind these changes, and the well structured explanation helped me figure out the ordering/purpose of the blob of specs that went in. Though I’m guessing I’ll still have a bunch of questions on this stuff when I’m reviewing it, I at least know more than I did before. Thanks a bunch for this Jay!

-----
Thanks,

Ryan Rossiter (rlrossit)




More information about the OpenStack-dev mailing list