[placement][nova][ptg] resource provider affinity

Nadathur, Sundar sundar.nadathur at intel.com
Sat Apr 27 18:26:59 UTC 2019

Hi Jay and Alex,
    Thanks for the response. Please see below.


> -----Original Message-----
> From: Jay Pipes <jaypipes at gmail.com>
> Sent: Saturday, April 27, 2019 8:52 AM
> To: openstack-discuss at lists.openstack.org
> Subject: Re: [placement][nova][ptg] resource provider affinity
> On 04/26/2019 08:49 PM, Alex Xu wrote:
> > Nadathur, Sundar <sundar.nadathur at intel.com
> >     Anyways, for Cyborg, it seems to me that there is a fairly
> >     straightforward scheme to address NUMA affinity: annotate the
> >     device’s nested RP with a trait indicating which NUMA node it
> >     belongs to (e.g. CUSTOM_NUMA_NODE_0), and use that to guide
> >     scheduling. This should be a valid use of traits because it
> >     expresses a property of the resource provider and is used for
> >     scheduling (only).
> >
> >
> > I don't like the way of using trait to mark out the NUMA node.
> Me neither. Traits are capabilities, not indicators of the relationship between
> one provider and another.
> The structure of hierarchical resource providers is what provides topology
> information -- i.e. about how providers are related to each other within a tree
> organization, and this is what is appropriate for encoding NUMA topology
> information into placement.
> The request should never ask for "NUMA Node 0". The reason is because the
> request shouldn't require that the user understand where the resources are.

I agree with this for most use cases. However, there are specific cases where a reference architecture is laid out for a specific workload, which requires a specific number of VMs to be placed in each NUMA node, with specific number of devices (NICs or accelerators) assigned to them. The network bandwidth, computation load, etc. are all pre-calculated to fit the VM's size and device characteristics. Any departure from that may affect workload performance -- throughput, latency or jitter. However, if the request says, 'Give me a VM on _a_ NUMA node, I don't care which one', one may wind up with say 3 VMs on one NUMA node and 1 VM on the other, which is not the intended outcome.

One could argue that we should model all resources, such as PCIe lanes/bandwidth from a socket (not the same as NUMA node), to the point where we can influence the exact placement among NUMA nodes. This has several issues, IMHO:
* This is more tied to the hardware details.
* Many of these resources are not dedicated or partitionable among VMs, e.g. PCIe lanes. I don’t see how we can track and count them in Placement on per-VM basis.
* It is more complex for both developers and operators.

In this situation, the operator is willing (in my understanding) to phrase the request precisely to get the exact desired layout.

> It shouldn't matter *which* NUMA node a particular device that is providing
> some resources is affined to. The only thing that matters to a
> *request* is that the user is able to describe the nature of the affinity.
> I propose using a "group_policy=same_tree:$GROUP_A:$GROUP_B" query
> parameter for enabling users to describe the affinity constraints for various
> resources involved in different RequestGroups in the request spec.
> group_policy=same_tree:$A:$B would mean "ensure that the providers that
> match the constraints of request group $B are in the same inclusive tree that
> matched for request group $A"

Request groups from Neutron and Cyborg do not have any inherent group numbers; Nova assigns those group numbers before submission to Placement. So, the GET /a-c call could technically have such numbers, but how would Neutron or Cyborg express that affinity? 
> So, let's say you have a flavor that will consume:
>   2 dedicated host CPU processors
>   4GB RAM
>   1 context/handle for an accelerator running a crypto algorithm
> Further, you want to ensure that the provider tree that is providing those
> dedicated CPUs and RAM will also provide the accelerator context
> -- in other words, you are requesting a low level of latency between the
> memory and the accelerator device itself.
> The above request to GET /a_c would look like this:
>   GET /a_c?
>     resources1=PCPU:2&
>     resources1=MEMORY_MB=4096&
>     resources2=ACCELERATOR_CONTEXT&
>     group_policy=same_tree:1:2
> which would mean, in English, "get me an accelerator context from an FPGA
> that has been flashed with the 4AC1 crypto bitstream and is affined to the
> NUMA node that is providing 4G of main memory and 2 dedicated host
> processors".
> Best,
> -jay

More information about the openstack-discuss mailing list