[placement][nova][ptg] resource provider affinity

Alex Xu soulxu at gmail.com
Mon Apr 29 03:52:25 UTC 2019


Nadathur, Sundar <sundar.nadathur at intel.com> 于2019年4月27日周六 下午12:29写道:

> Hi Jay and Alex,
>     Thanks for the response. Please see below.
>
> Regards,
> Sundar
>
> > -----Original Message-----
> > From: Jay Pipes <jaypipes at gmail.com>
> > Sent: Saturday, April 27, 2019 8:52 AM
> > To: openstack-discuss at lists.openstack.org
> > Subject: Re: [placement][nova][ptg] resource provider affinity
> >
> > On 04/26/2019 08:49 PM, Alex Xu wrote:
> > > Nadathur, Sundar <sundar.nadathur at intel.com
> > >     Anyways, for Cyborg, it seems to me that there is a fairly
> > >     straightforward scheme to address NUMA affinity: annotate the
> > >     device’s nested RP with a trait indicating which NUMA node it
> > >     belongs to (e.g. CUSTOM_NUMA_NODE_0), and use that to guide
> > >     scheduling. This should be a valid use of traits because it
> > >     expresses a property of the resource provider and is used for
> > >     scheduling (only).
> > >
> > >
> > > I don't like the way of using trait to mark out the NUMA node.
> >
> > Me neither. Traits are capabilities, not indicators of the relationship
> between
> > one provider and another.
> >
> > The structure of hierarchical resource providers is what provides
> topology
> > information -- i.e. about how providers are related to each other within
> a tree
> > organization, and this is what is appropriate for encoding NUMA topology
> > information into placement.
> >
> > The request should never ask for "NUMA Node 0". The reason is because the
> > request shouldn't require that the user understand where the resources
> are.
>
> I agree with this for most use cases. However, there are specific cases
> where a reference architecture is laid out for a specific workload, which
> requires a specific number of VMs to be placed in each NUMA node, with
> specific number of devices (NICs or accelerators) assigned to them. The
> network bandwidth, computation load, etc. are all pre-calculated to fit the
> VM's size and device characteristics. Any departure from that may affect
> workload performance -- throughput, latency or jitter. However, if the
> request says, 'Give me a VM on _a_ NUMA node, I don't care which one', one
> may wind up with say 3 VMs on one NUMA node and 1 VM on the other, which is
> not the intended outcome.
>

We can control the number of available shared/dedicated vCPUs in a NUMA
node, so people will know how many VMs in each NUMA node. If you affinity
multiple VM into same NUMA node, that will be another thing, the affinity
between VM.


>
> One could argue that we should model all resources, such as PCIe
> lanes/bandwidth from a socket (not the same as NUMA node), to the point
> where we can influence the exact placement among NUMA nodes. This has
> several issues, IMHO:
> * This is more tied to the hardware details.
> * Many of these resources are not dedicated or partitionable among VMs,
> e.g. PCIe lanes. I don’t see how we can track and count them in Placement
> on per-VM basis.
> * It is more complex for both developers and operators.
>

I don't think that is the purpose of modeling socket in the placement.


> In this situation, the operator is willing (in my understanding) to phrase
> the request precisely to get the exact desired layout.
>
> > It shouldn't matter *which* NUMA node a particular device that is
> providing
> > some resources is affined to. The only thing that matters to a
> > *request* is that the user is able to describe the nature of the
> affinity.
> >
> > I propose using a "group_policy=same_tree:$GROUP_A:$GROUP_B" query
> > parameter for enabling users to describe the affinity constraints for
> various
> > resources involved in different RequestGroups in the request spec.
> >
> > group_policy=same_tree:$A:$B would mean "ensure that the providers that
> > match the constraints of request group $B are in the same inclusive tree
> that
> > matched for request group $A"
>
> Request groups from Neutron and Cyborg do not have any inherent group
> numbers; Nova assigns those group numbers before submission to Placement.
> So, the GET /a-c call could technically have such numbers, but how would
> Neutron or Cyborg express that affinity?
>

Yea, that is what I'm saying, it should be in the nova flavor.


>
> > So, let's say you have a flavor that will consume:
> >
> >   2 dedicated host CPU processors
> >   4GB RAM
> >   1 context/handle for an accelerator running a crypto algorithm
> >
> > Further, you want to ensure that the provider tree that is providing
> those
> > dedicated CPUs and RAM will also provide the accelerator context
> > -- in other words, you are requesting a low level of latency between the
> > memory and the accelerator device itself.
> >
> > The above request to GET /a_c would look like this:
> >
> >   GET /a_c?
> >     resources1=PCPU:2&
> >     resources1=MEMORY_MB=4096&
> >     resources2=ACCELERATOR_CONTEXT&
> >     required2=CUSTOM_BITSTREAM_CRYPTO_4AC1&
> >     group_policy=same_tree:1:2
> >
> > which would mean, in English, "get me an accelerator context from an FPGA
> > that has been flashed with the 4AC1 crypto bitstream and is affined to
> the
> > NUMA node that is providing 4G of main memory and 2 dedicated host
> > processors".
> >
> > Best,
> > -jay
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190428/52d6f614/attachment.html>


More information about the openstack-discuss mailing list