[openstack-dev] [nova] RFC for Intel RDT/CAT Support in Nova for Virtual Machine QoS

Alex Xu soulxu at gmail.com
Wed Feb 22 16:29:30 UTC 2017


@Jay, actually I'm here for CAT. I also have another idea about the
proposal, so catch you about it, let us sync all the ideas. :)

Thanks
Alex

2017-02-22 11:17 GMT-05:00 Jay Pipes <jaypipes at gmail.com>:

> Hi Eli,
>
> Sorry for top-posting. Just a quick note to say I had a good conversation
> on Monday about this with Sean Mooney. I think we have some ideas on how to
> model all of these resources in the new placement/resource providers schema.
>
> Are you at the PTG? If so, would be great to meet up to discuss...
>
> Best,
> -jay
>
> On 02/21/2017 05:38 AM, Qiao, Liyong wrote:
>
>> Hi folks:
>>
>>
>>
>> Seeking community input on an initial design for Intel Resource Director
>> Technology (RDT), in particular for Cache Allocation Technology in
>> OpenStack Nova to protect workloads from co-resident noisy neighbors, to
>> ensure quality of service (QoS).
>>
>>
>>
>> 1. What is Cache Allocation Technology (CAT)?**
>>
>> Intel’s RDT(Resource Director Technology) [1]  is a umbrella of
>> *hardware* support to facilitate the monitoring and reservation of
>> shared resources such as cache, memory and network bandwidth towards
>> obtaining Quality of Service. RDT will enable fine grain control of
>> resources which in particular is valuable in cloud environments to meet
>> Service Level Agreements while increasing resource utilization through
>> sharing. CAT is a part of RDT and concerns itself with reserving for a
>> process(es) a portion of last level cache with further fine grain
>> control as to how much for code versus data. The below figure shows a
>> single processor composed of 4 cores and the cache hierarchy. The L1
>> cache is split into Instruction and Data, the L2 cache is next in speed
>> to L1. The L1 and L2 caches are per core. The Last Level Cache (LLC) is
>> shared among all cores. With CAT on the currently available hardware the
>> LLC can be partitioned on a per process (virtual machine, container, or
>> normal application) or process group basis.
>>
>>
>>
>> Libvirt and OpenStack [2] already support monitoring cache (CMT) and
>> memory bandwidth usage local to a processor socket (MBM_local) and total
>> memory bandwidth usage across all processor sockets (MBM_total) for a
>> process or process group.
>>
>>
>>
>>
>> 2. How CAT works  **
>>
>> To learn more about CAT please refer to the Intel Processor Soft
>> Developer's Manual
>> <http://www.intel.com/content/www/us/en/processors/architect
>> ures-software-developer-manuals.html>
>>  volume 3b, chapters 17.16 and 17.17 [3]. Linux kernel support for the
>> same is expected in release 4.10 and documented at [4]
>>
>>
>> 3. Libvirt Interface**
>>
>>
>> Libvirt support for CAT is underway with the patch at reversion 7
>>
>>
>>
>> Interface changes of libvirt:
>>
>>
>>
>> 3.1 The capabilities xml has been extended to reveal cache information **
>>
>>
>>
>> <cache>
>>
>>      <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>
>>
>>        <control min='2816' reserved='2816' unit='KiB' scope='L3'/>
>>
>>      </bank>
>>
>>      <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>
>>
>>        <control min='2816' reserved='2816' unit='KiB' scope='L3'/>
>>
>>      </bank>
>>
>> </cache>
>>
>>
>>
>> The new `cache` xml element shows that the host has two *banks* of
>> *type* L3 or Last Level Cache (LLC), one per processor socket. The cache
>> *type* is l3 cache, its *size* 56320 KiB, and the *cpus* attribute
>> indicates the physical CPUs associated with the same, here ‘0-21’,
>> ‘44-65’ respectively.
>>
>>
>>
>> The *control *tag shows that bank belongs to scope L3, with a minimum
>> possible allocation of 2816 KiB and still has 2816 KiB need to be
>> reserved.
>>
>>
>>
>> If the host enabled CDP (Code and Data Prioritization) , l3 cache will
>> be divided as code  (L3CODE)and data (L3Data).
>>
>>
>>
>> Control tag will be extended to:
>>
>> ...
>>
>>  <control min='2816' reserved='2816' unit='KiB' scope='L3CODE'/>
>>
>>  <control min='2816' reserved='2816' unit='KiB' scope='L3DATA'/>
>>
>>>>
>>
>>
>> The scope of L3CODE and L3DATA show that we can allocate cache for
>> code/data usage respectively, they share same amount of l3 cache.
>>
>>
>>
>> 3.2 Domain xml extended to include new CacheTune element **
>>
>>
>>
>> <cputune>
>>
>>    <vcpupin vcpu='0' cpuset='0'/>
>>
>>                <vcpupin vcpu='1' cpuset='1'/>
>>
>>    <vcpupin vcpu='2' cpuset='0'/>
>>
>>                <vcpupin vcpu='3' cpuset='1'/>
>>
>>    <cachetune id='0' host_id='0' type='l3' size='2816' unit='KiB'
>> vcpus='0, 1/>
>>
>>    <cachetune id='1' host_id='1' type='l3' size='2816' unit='KiB'
>> vcpus=’2, 3’/>
>>
>> ...
>>
>> </cputune>
>>
>>
>>
>> This means the guest will be have vcpus 0, 1 running on host’s socket 0,
>> with 2816 KiB cache exclusively allocated to it and vcpus 2, 3 running
>> on host’s socket 0, with 2816 KiB cache exclusively allocated to it.
>>
>>
>>
>> Here we need to make sure vcpus 0, 1 are pinned to the pcpus of socket
>> 0, refer capabilities
>>
>>  <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>:
>>
>>
>>
>> Here we need to make sure vcpus 2, 3 are pinned to the pcpus of socket
>> 1, refer capabilities
>>
>>  <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>:.
>>
>>
>>
>> 3.3 Libvirt work flow for CAT**
>>
>>
>>
>>  1. Create qemu process and get it’s PIDs
>>  2. Define a new resource control domain also known as
>>     *Cl*ass-*o*f-*S*ervice (CLOS) under /sys/fs/resctrl and set the
>>     desired *C*ache *B*it *M*ask(CBM) in the libvirt domain xml file in
>>     addition to updating the default schemata of the host
>>
>>
>>
>> 4. Proposed Nova Changes**
>>
>>
>>
>>  1. Get host capabilities from libvirt and extend compute node’ filed
>>  2. Add new scheduler filter and weight to help schedule host for
>>     requested guest.
>>  3. Extend flavor’s (and image meta) extra spec fields:
>>
>>
>>
>> We need to specify  numa setting for NUMA hosts if we want to enable
>> CAT, see [5] to learn more about NUMA.
>>
>> In flavor, we can have:
>>
>>
>>
>> vcpus=8
>>
>> mem=4
>>
>> hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest.
>>
>> hw:numa_cpus.0=0,1,2,3,4,5
>>
>> hw:numa_cpus.1=6,7
>>
>> hw:numa_mem.0=3072
>>
>> hw:numa_mem.1=1024
>>
>> //  new added in the proposal
>>
>> hw:cache_banks=2   ///cache banks to be allocated to a  guest, (can be
>> less than the number of NUMA nodes)/
>>
>> hw:cache_type.0=l3  ///cache bank type, could be l3, l3data + l3code/
>>
>> hw:cache_type.1=l3_c+d  ///cache bank type, could be l3, l3data + l3code/
>>
>> hw:cache_vcpus.0=0,1  ///vcpu list on cache banks, can be none/
>>
>> hw:cache_vcpus.1=6,7
>>
>> hw:cache_l3.0=2816  ///cache size in KiB./
>>
>> hw:cache_l3_code.1=2816
>>
>> hw:cache_l3_data.1=2816
>>
>>
>>
>> Here, user can clear about which vcpus will benefit cache allocation,
>> about cache bank, it’s should be co-work with numa cell, it will
>> allocate cache on a physical CPU socket, but here cache bank is a logic
>> concept. Cache bank will allocate cache for a vcpu list, all vcpu list
>> should group
>>
>>
>>
>> Modify in addition the <cachetune> element in libvirt domain xml, see
>> 3.2 for detail
>>
>>
>>
>> This will allocate 2 cache banks from the host’s cache banks and
>> associate vcpus to the same.
>>
>> In the example, the guest will be have vcpus 0, 1 running on socket 0 of
>> the host with 2816 KiB of cache for exclusive use and have vcpus 6, 7
>> running on socket 1 of the host with l3 code cache 2816KiB and l3 data
>> with 2816KiB cache allocation.
>>
>>
>>
>> If a NUMA Cell were to contain multiple CPU sockets (this is rare), then
>> we will adjust NUMA vCPU placement policy, to ensure that vCPUs and the
>> cache allocated to them are all co-located on the same socket.
>>
>>
>>
>>   * We can define less cache bank on a multiple NUMA cell node.
>>   * No cache_vcpus parameter needs to be specified if no reservation is
>>     desired.
>>
>>
>>
>> NOTE: the cache allocation for a guest is in isolated/exclusive mode.
>>
>>
>>
>> References**
>>
>>
>>
>> [1]
>> http://www.intel.com/content/www/us/en/architecture-and-tech
>> nology/resource-director-technology.html
>>
>> [2] https://blueprints.launchpad.net/nova/+spec/support-perf-event
>>
>> [3]
>> http://www.intel.com/content/www/us/en/processors/architectu
>> res-software-developer-manuals.html
>>
>> [4]
>> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/tre
>> e/Documentation/x86/intel_rdt_ui.txt?h=x86/cache
>>
>>
>> [5]
>> https://specs.openstack.org/openstack/nova-specs/specs/juno/
>> implemented/virt-driver-numa-placement.html
>>
>>
>>
>>
>>
>>
>> Best Regards
>>
>>
>>
>> Eli Qiao(乔立勇)OpenStack Core team OTC Intel.
>>
>> --
>>
>>
>>
>>
>>
>> ____________________________________________________________
>> ______________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170222/065f3f18/attachment.html>


More information about the OpenStack-dev mailing list