Open Stack

Tue Feb 21 10:38:32 UTC 2017

Hi folks:

Seeking community input on an initial design for Intel Resource Director Technology (RDT), in particular for Cache Allocation Technology in OpenStack Nova to protect workloads from co-resident noisy neighbors, to ensure quality of service (QoS).

1. What is Cache Allocation Technology (CAT)?
Intel’s RDT(Resource Director Technology) [1]  is a umbrella of hardware support to facilitate the monitoring and reservation of shared resources such as cache, memory and network bandwidth towards obtaining Quality of Service. RDT will enable fine grain control of resources which in particular is valuable in cloud environments to meet Service Level Agreements while increasing resource utilization through sharing. CAT is a part of RDT and concerns itself with reserving for a process(es) a portion of last level cache with further fine grain control as to how much for code versus data. The below figure shows a single processor composed of 4 cores and the cache hierarchy. The L1 cache is split into Instruction and Data, the L2 cache is next in speed to L1. The L1 and L2 caches are per core. The Last Level Cache (LLC) is shared among all cores. With CAT on the currently available hardware the LLC can be partitioned on a per process (virtual machine, container, or normal application) or process group basis.

Libvirt and OpenStack [2] already support monitoring cache (CMT) and memory bandwidth usage local to a processor socket (MBM_local) and total memory bandwidth usage across all processor sockets (MBM_total) for a process or process group.

2. How CAT works
To learn more about CAT please refer to the Intel Processor Soft Developer's Manual<http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html>  volume 3b, chapters 17.16 and 17.17 [3]. Linux kernel support for the same is expected in release 4.10 and documented at [4]

3. Libvirt Interface

Libvirt support for CAT is underway with the patch at reversion 7

Interface changes of libvirt:

3.1 The capabilities xml has been extended to reveal cache information

<cache>
     <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>
       <control min='2816' reserved='2816' unit='KiB' scope='L3'/>
     </bank>
     <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>
       <control min='2816' reserved='2816' unit='KiB' scope='L3'/>
     </bank>
</cache>

The new `cache` xml element shows that the host has two banks of type L3 or Last Level Cache (LLC), one per processor socket. The cache type is l3 cache, its size 56320 KiB, and the cpus attribute indicates the physical CPUs associated with the same, here ‘0-21’, ‘44-65’ respectively.

The control tag shows that bank belongs to scope L3, with a minimum possible allocation of 2816 KiB and still has 2816 KiB need to be reserved.

If the host enabled CDP (Code and Data Prioritization) , l3 cache will be divided as code  (L3CODE)and data (L3Data).

Control tag will be extended to:
...
 <control min='2816' reserved='2816' unit='KiB' scope='L3CODE'/>
 <control min='2816' reserved='2816' unit='KiB' scope='L3DATA'/>
…

The scope of L3CODE and L3DATA show that we can allocate cache for code/data usage respectively, they share same amount of l3 cache.

3.2 Domain xml extended to include new CacheTune element

<cputune>
   <vcpupin vcpu='0' cpuset='0'/>
               <vcpupin vcpu='1' cpuset='1'/>
   <vcpupin vcpu='2' cpuset='0'/>
               <vcpupin vcpu='3' cpuset='1'/>
   <cachetune id='0' host_id='0' type='l3' size='2816' unit='KiB' vcpus='0, 1/>
   <cachetune id='1' host_id='1' type='l3' size='2816' unit='KiB' vcpus=’2, 3’/>
...
</cputune>

This means the guest will be have vcpus 0, 1 running on host’s socket 0, with 2816 KiB cache exclusively allocated to it and vcpus 2, 3 running on host’s socket 0, with 2816 KiB cache exclusively allocated to it.

Here we need to make sure vcpus 0, 1 are pinned to the pcpus of socket 0, refer capabilities
 <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>:

Here we need to make sure vcpus 2, 3 are pinned to the pcpus of socket 1, refer capabilities
 <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>:.

3.3 Libvirt work flow for CAT

  1.  Create qemu process and get it’s PIDs
  2.  Define a new resource control domain also known as Class-of-Service (CLOS) under /sys/fs/resctrl and set the desired Cache Bit Mask(CBM) in the libvirt domain xml file in addition to updating the default schemata of the host

4. Proposed Nova Changes

  1.  Get host capabilities from libvirt and extend compute node’ filed
  2.  Add new scheduler filter and weight to help schedule host for requested guest.
  3.  Extend flavor’s (and image meta) extra spec fields:

We need to specify  numa setting for NUMA hosts if we want to enable CAT, see [5] to learn more about NUMA.
In flavor, we can have:

vcpus=8
mem=4
hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest.
hw:numa_cpus.0=0,1,2,3,4,5
hw:numa_cpus.1=6,7
hw:numa_mem.0=3072
hw:numa_mem.1=1024
//  new added in the proposal
hw:cache_banks=2   //cache banks to be allocated to a  guest, (can be less than the number of NUMA nodes)
hw:cache_type.0=l3  //cache bank type, could be l3, l3data + l3code
hw:cache_type.1=l3_c+d  //cache bank type, could be l3, l3data + l3code
hw:cache_vcpus.0=0,1  //vcpu list on cache banks, can be none
hw:cache_vcpus.1=6,7
hw:cache_l3.0=2816  //cache size in KiB.
hw:cache_l3_code.1=2816
hw:cache_l3_data.1=2816

Here, user can clear about which vcpus will benefit cache allocation, about cache bank, it’s should be co-work with numa cell, it will allocate cache on a physical CPU socket, but here cache bank is a logic concept. Cache bank will allocate cache for a vcpu list, all vcpu list should group

Modify in addition the <cachetune> element in libvirt domain xml, see 3.2 for detail

This will allocate 2 cache banks from the host’s cache banks and associate vcpus to the same.
In the example, the guest will be have vcpus 0, 1 running on socket 0 of the host with 2816 KiB of cache for exclusive use and have vcpus 6, 7 running on socket 1 of the host with l3 code cache 2816KiB and l3 data with 2816KiB cache allocation.

If a NUMA Cell were to contain multiple CPU sockets (this is rare), then we will adjust NUMA vCPU placement policy, to ensure that vCPUs and the cache allocated to them are all co-located on the same socket.

  *   We can define less cache bank on a multiple NUMA cell node.
  *   No cache_vcpus parameter needs to be specified if no reservation is desired.

NOTE: the cache allocation for a guest is in isolated/exclusive mode.

References

[1] http://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html
[2] https://blueprints.launchpad.net/nova/+spec/support-perf-event
[3] http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
[4] https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/tree/Documentation/x86/intel_rdt_ui.txt?h=x86/cache
[5] https://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html

Best Regards

Eli Qiao(乔立勇）OpenStack Core team OTC Intel.
--

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170221/3e32e8a5/attachment.html>

Open Stack

[openstack-dev] [nova] RFC for Intel RDT/CAT Support in Nova for Virtual Machine QoS

OpenStack

Community

Documentation

Branding & Legal