[ops][nova][placement] NUMA topology vs non-NUMA workloads

30 May 2019

      This message is primarily addressed at operators, and of those,
operators who are interested in effectively managing and mixing
workloads that care about NUMA with workloads that do not. There are
some questions within, after some background to explain the issue.

At the PTG, Nova and Placement developers made a commitment to more
effectively manage NUMA topologies within Nova and Placement. On the
placement side this resulted in a spec which proposed several
features that would enable more expressive queries when requesting
allocation candidates (places for workloads to go), resulting in
fewer late scheduling failures.

At first there was one spec that discussed all the features. This
morning it was split in two because one of the features is proving
hard to resolve. Those two specs can be found at:

* https://review.opendev.org/658510 (has all the original discussion)
* https://review.opendev.org/662191 (the less contentious features split out)

After much discussion, we would prefer to not do the feature
discussed in 658510. Called 'can_split', it would allow specified
classes of resource (notably VCPU and memory) to be split across
multiple numa nodes when each node can only contribute a portion of
the required resources and where those resources are modelled as
inventory on the NUMA nodes, not the host at large.

While this is a good idea in principle it turns out (see the spec)
to cause many issues that require changes throughout the ecosystem,
for example enforcing pinned cpus for workloads that would normally
float. It's possible to make the changes, but it would require
additional contributors to join the effort, both in terms of writing
the code and understanding the many issues.

So the questions:

* How important, in your cloud, is it to co-locate guests needing a
   NUMA topology with guests that do not? A review of documentation
   (upstream and vendor) shows differing levels of recommendation on
   this, but in many cases the recommendation is to not do it.

* If your answer to the above is "we must be able to do that": How
   important is it that your cloud be able to pack workloads as tight
   as possible? That is: If there are two NUMA nodes and each has 2
   VCPU free, should a 4 VCPU demanding non-NUMA workload be able to
   land there? Or would you prefer that not happen?

* If the answer to the first question is "we can get by without
   that" is it satisfactory to be able to configure some hosts as NUMA
   aware and others as not, as described in the "NUMA topology with
   RPs" spec [1]? In this set up some non-NUMA workloads could end up
   on a NUMA host (unless otherwise excluded by traits or aggregates),
   but only when there was contiguous resource available.

This latter question articulates the current plan unless responses
to this message indicate it simply can't work or legions of
assistance shows up. Note that even if we don't do can_split, we'll
still be enabling significant progress with the other features
described in the second spec [2].

Thanks for your help in moving us in the right direction.

[1] https://review.opendev.org/552924
[2] https://review.opendev.org/662191
-- 
Chris Dent                       ٩◔̯◔۶           https://anticdent.org/
freenode: cdent

Chris Dent

Tim Bell

Arne Wiebalck

tags

participants (3)