[openstack-dev] [nova] RFC for Intel RDT/CAT Support in Nova for Virtual Machine QoS

Jay Pipes jaypipes at gmail.com
Wed Feb 22 16:17:37 UTC 2017


Hi Eli,

Sorry for top-posting. Just a quick note to say I had a good 
conversation on Monday about this with Sean Mooney. I think we have some 
ideas on how to model all of these resources in the new 
placement/resource providers schema.

Are you at the PTG? If so, would be great to meet up to discuss...

Best,
-jay

On 02/21/2017 05:38 AM, Qiao, Liyong wrote:
> Hi folks:
>
>
>
> Seeking community input on an initial design for Intel Resource Director
> Technology (RDT), in particular for Cache Allocation Technology in
> OpenStack Nova to protect workloads from co-resident noisy neighbors, to
> ensure quality of service (QoS).
>
>
>
> 1. What is Cache Allocation Technology (CAT)?**
>
> Intel’s RDT(Resource Director Technology) [1]  is a umbrella of
> *hardware* support to facilitate the monitoring and reservation of
> shared resources such as cache, memory and network bandwidth towards
> obtaining Quality of Service. RDT will enable fine grain control of
> resources which in particular is valuable in cloud environments to meet
> Service Level Agreements while increasing resource utilization through
> sharing. CAT is a part of RDT and concerns itself with reserving for a
> process(es) a portion of last level cache with further fine grain
> control as to how much for code versus data. The below figure shows a
> single processor composed of 4 cores and the cache hierarchy. The L1
> cache is split into Instruction and Data, the L2 cache is next in speed
> to L1. The L1 and L2 caches are per core. The Last Level Cache (LLC) is
> shared among all cores. With CAT on the currently available hardware the
> LLC can be partitioned on a per process (virtual machine, container, or
> normal application) or process group basis.
>
>
>
> Libvirt and OpenStack [2] already support monitoring cache (CMT) and
> memory bandwidth usage local to a processor socket (MBM_local) and total
> memory bandwidth usage across all processor sockets (MBM_total) for a
> process or process group.
>
>
>
>
> 2. How CAT works  **
>
> To learn more about CAT please refer to the Intel Processor Soft
> Developer's Manual
> <http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html>
>  volume 3b, chapters 17.16 and 17.17 [3]. Linux kernel support for the
> same is expected in release 4.10 and documented at [4]
>
>
> 3. Libvirt Interface**
>
>
> Libvirt support for CAT is underway with the patch at reversion 7
>
>
>
> Interface changes of libvirt:
>
>
>
> 3.1 The capabilities xml has been extended to reveal cache information **
>
>
>
> <cache>
>
>      <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>
>
>        <control min='2816' reserved='2816' unit='KiB' scope='L3'/>
>
>      </bank>
>
>      <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>
>
>        <control min='2816' reserved='2816' unit='KiB' scope='L3'/>
>
>      </bank>
>
> </cache>
>
>
>
> The new `cache` xml element shows that the host has two *banks* of
> *type* L3 or Last Level Cache (LLC), one per processor socket. The cache
> *type* is l3 cache, its *size* 56320 KiB, and the *cpus* attribute
> indicates the physical CPUs associated with the same, here ‘0-21’,
> ‘44-65’ respectively.
>
>
>
> The *control *tag shows that bank belongs to scope L3, with a minimum
> possible allocation of 2816 KiB and still has 2816 KiB need to be reserved.
>
>
>
> If the host enabled CDP (Code and Data Prioritization) , l3 cache will
> be divided as code  (L3CODE)and data (L3Data).
>
>
>
> Control tag will be extended to:
>
> ...
>
>  <control min='2816' reserved='2816' unit='KiB' scope='L3CODE'/>
>
>  <control min='2816' reserved='2816' unit='KiB' scope='L3DATA'/>
>
>>
>
>
> The scope of L3CODE and L3DATA show that we can allocate cache for
> code/data usage respectively, they share same amount of l3 cache.
>
>
>
> 3.2 Domain xml extended to include new CacheTune element **
>
>
>
> <cputune>
>
>    <vcpupin vcpu='0' cpuset='0'/>
>
>                <vcpupin vcpu='1' cpuset='1'/>
>
>    <vcpupin vcpu='2' cpuset='0'/>
>
>                <vcpupin vcpu='3' cpuset='1'/>
>
>    <cachetune id='0' host_id='0' type='l3' size='2816' unit='KiB'
> vcpus='0, 1/>
>
>    <cachetune id='1' host_id='1' type='l3' size='2816' unit='KiB'
> vcpus=’2, 3’/>
>
> ...
>
> </cputune>
>
>
>
> This means the guest will be have vcpus 0, 1 running on host’s socket 0,
> with 2816 KiB cache exclusively allocated to it and vcpus 2, 3 running
> on host’s socket 0, with 2816 KiB cache exclusively allocated to it.
>
>
>
> Here we need to make sure vcpus 0, 1 are pinned to the pcpus of socket
> 0, refer capabilities
>
>  <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>:
>
>
>
> Here we need to make sure vcpus 2, 3 are pinned to the pcpus of socket
> 1, refer capabilities
>
>  <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>:.
>
>
>
> 3.3 Libvirt work flow for CAT**
>
>
>
>  1. Create qemu process and get it’s PIDs
>  2. Define a new resource control domain also known as
>     *Cl*ass-*o*f-*S*ervice (CLOS) under /sys/fs/resctrl and set the
>     desired *C*ache *B*it *M*ask(CBM) in the libvirt domain xml file in
>     addition to updating the default schemata of the host
>
>
>
> 4. Proposed Nova Changes**
>
>
>
>  1. Get host capabilities from libvirt and extend compute node’ filed
>  2. Add new scheduler filter and weight to help schedule host for
>     requested guest.
>  3. Extend flavor’s (and image meta) extra spec fields:
>
>
>
> We need to specify  numa setting for NUMA hosts if we want to enable
> CAT, see [5] to learn more about NUMA.
>
> In flavor, we can have:
>
>
>
> vcpus=8
>
> mem=4
>
> hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest.
>
> hw:numa_cpus.0=0,1,2,3,4,5
>
> hw:numa_cpus.1=6,7
>
> hw:numa_mem.0=3072
>
> hw:numa_mem.1=1024
>
> //  new added in the proposal
>
> hw:cache_banks=2   ///cache banks to be allocated to a  guest, (can be
> less than the number of NUMA nodes)/
>
> hw:cache_type.0=l3  ///cache bank type, could be l3, l3data + l3code/
>
> hw:cache_type.1=l3_c+d  ///cache bank type, could be l3, l3data + l3code/
>
> hw:cache_vcpus.0=0,1  ///vcpu list on cache banks, can be none/
>
> hw:cache_vcpus.1=6,7
>
> hw:cache_l3.0=2816  ///cache size in KiB./
>
> hw:cache_l3_code.1=2816
>
> hw:cache_l3_data.1=2816
>
>
>
> Here, user can clear about which vcpus will benefit cache allocation,
> about cache bank, it’s should be co-work with numa cell, it will
> allocate cache on a physical CPU socket, but here cache bank is a logic
> concept. Cache bank will allocate cache for a vcpu list, all vcpu list
> should group
>
>
>
> Modify in addition the <cachetune> element in libvirt domain xml, see
> 3.2 for detail
>
>
>
> This will allocate 2 cache banks from the host’s cache banks and
> associate vcpus to the same.
>
> In the example, the guest will be have vcpus 0, 1 running on socket 0 of
> the host with 2816 KiB of cache for exclusive use and have vcpus 6, 7
> running on socket 1 of the host with l3 code cache 2816KiB and l3 data
> with 2816KiB cache allocation.
>
>
>
> If a NUMA Cell were to contain multiple CPU sockets (this is rare), then
> we will adjust NUMA vCPU placement policy, to ensure that vCPUs and the
> cache allocated to them are all co-located on the same socket.
>
>
>
>   * We can define less cache bank on a multiple NUMA cell node.
>   * No cache_vcpus parameter needs to be specified if no reservation is
>     desired.
>
>
>
> NOTE: the cache allocation for a guest is in isolated/exclusive mode.
>
>
>
> References**
>
>
>
> [1]
> http://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html
>
> [2] https://blueprints.launchpad.net/nova/+spec/support-perf-event
>
> [3]
> http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
>
> [4]
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/tree/Documentation/x86/intel_rdt_ui.txt?h=x86/cache
>
>
> [5]
> https://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html
>
>
>
>
>
>
> Best Regards
>
>
>
> Eli Qiao(乔立勇)OpenStack Core team OTC Intel.
>
> --
>
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



More information about the OpenStack-dev mailing list