Dears, Historically, both AMD EPYC and Intel Xeon had NUMA nodes within one CPU socket. For Intel haswell/broadwell, they could have two rings to connect multiple CPUs: +----------------------------+ +-------------------------+ +---+ +-----+ +----+ +---+ +---+ +-----+ +----+ +---+ | | | | +---+ +-----+ +----+ ring +---+ +---+ ring +-----+ +----+ +---+ | | | | +---+ +-----+ +----+ +---+ +---+ +-----+ +----+ +---+ | | | | | | | | +----------------------------+ +-------------------------+ Due to the latency to access the memory located in different rings, cluster-on-die(COD) NUMA nodes did exist in haswell/ broadwell. For AMD EPYC, they used to have 4 DIEs in one CPU socket: +-------------------------------------------------+ | AMD EPYC | | +-------------+ +-------------+ | | | | | | | | | die | | die | | | | |-------- | | | | |\ /| | | | +-------------+ -\ /- +-------------+ | | | /-\ | | | +-------|------+/- -+-------|------+ | | | - | | | | | die | | die | | | | |------| | | | | | | | | | +--------------+ +--------------+ | | | +-------------------------------------------------+ These 4 different DIEs could be configured as 4 different NUMA. However, with the development of hardware, Intel and AMD no longer need to be aware of NUMA nodes within single one CPU. Intel has moved to mesh since skylake(server): +-------|-----------------------+ | | | | | | | | | | |-------|-------|-------|-------| | | | | | | | | | | |-------|-------|-------|-------| | | | | | | | | | | |-------|-------|-------|-------| | | | | | | | | | | +-------------------------------+ AMD moved memory and I/O to a separate DIE, thus made the whole CPU become UMA in EPYC II: +---------+ +---------+ +---------+ +---------+ | cpu die | |cpu die | |cpu die | |cpu die | | | | | | | | | +---------+ +---------+ +---------+ +---------+ +------------------------------------------------------+ | | | memory and I/O DIE | | | +------------------------------------------------------+ +---------+ +---------+ +---------+ +---------+ |cpu die | |cpu die | |cpu die | |cpu die | | | | | | | | | +---------+ +---------+ +---------+ +---------+ Skylake, while still having SNC(sub-numa-cluster) within a DIE for a huge mesh network, doesn't really suffer from this kind of internal sub-numa since the latency difference is really minor to enable SNC or not. According to "64-ia-32-architectures-optimization-manual"[1], for a typical 2P system, disabling and enabling SNC will bring minor latency difference for memory access within one CPU: SNC off: Using buffer size of 2000.000MB Measuring idle latencies (in ns)... Numa node Numa node 0 1 0 81.9 153.1 1 153.7 82.0 SNC on: Using buffer size of 2000.000MB Measuring idle latencies (in ns)... Numa node Numa node 0 1 2 3 0 81.6 89.4 140.4 153.6 1 86.5 78.5 144.3 162.8 2 142.3 153.0 81.6 89.3 3 144.5 162.8 85.5 77.4 As above, if we disable sub-numa in one CPU, the CPU has memory latency as 81.9, but if sub-numa is enabled, the advantage is really minor, SNC can access its own memory with the latency of 81.6 with 0.3 gap only. So SNC doesn't really matter on Xeon. However, the development of kunpeng920's topology is different from Intel and AMD. Similar with AMD EPYC, kunpeng920 has two DIEs in one CPU. Unlike EPYC which has only 8cores in each DIE, each DIE of kunpeng920 has 24 or 32 cores. For a typical 2P system, we can set it to 2NUMA or 4NUMA. +------------------------------+ +------------------------------+ | CPU | | CPU | | +----------+ +----------+ | | +----------+ +----------+ | | | | | | | | | | | | | | | DIE | | DIE | | | | DIE | | DIE | | | | | | | | | | | | | | | | | | | | | | | | | | | +----------+ +----------+ | | +----------+ +----------+ | | | | | +------------------------------+ +------------------------------+ * 2NUMA - DIE interleave In this case, two DIEs become one NUMA. The advantage is that we are getting more CPUs in one NUMA, so this decreases the fragment of CPUs and help deploy more virtual machines on the same host when we apply the rule of pinning VM within one NUMA, compared with disabling DIE interleave. But, this has some obvious disadvantage. Since we need to run DIE interleave, the average memory access latency could increase much. Enabling DIE interleave: Numa node Numa node 0 1 0 95.68 199.21 1 199.21 95.68 Disabling DIE interleave Numa node Numa node 0 1 2 3 0 85.79 104.33 189.95 209.00 1 104.33 85.79 209.00 229.60 2 189.95 209.00 85.79 104.33 3 209.00 229.60 104.33 85.79 As above, one DIE can access its local memory with latency of 85.79, but when die-interleave is enabled, the latency becomes 95.68. The gap 95.68-85.79 isn't minor. * 4NUMA - each DIE becomes one NUMA In this way, NUMA-aware software can access its local memory with much lower latency. Thus, we can gain some performance improvement in openstack if one VM is deployed in one DIE and only access its local memory. Testing has shown 4%+ performance improvement. However, under the rule that VM does not cross NUMA, less VMs could be deployed on the same host. In order to achieve both goals of performance improvement and anti-fragmentation, we are looking for some way of sub-numa awareness in openstack scheduler. That means, 1. we will disable DIE interleave, thus, for 2P system, we are going to have 4 NUMA; 2. we use the rule VM shouldn't cross CPU, thinking one CPU as a big NUMA and each DIE as a smaller sub-numa. 3. For guests, we can define different priorities, for high priority VMs whose performance is a great concern, openstack will try to deploy them in same sub-numa which is a DIE for kunpeng920; for low priority VMs who can endure across-die latency, they could be placed on different sub-numa, but still within same CPU. So basically, we'd like to make openstack aware of the topology of sub-numa within one cpu and support a rule which prevents cross-CPU but allows cross-DIE. At the same time, this scheduler should try its best to deploy important VMs in same sub-numa and allow relatively unimportant VMs to cross two sub-numa. This might be done by customized openstack flavor according to [2], but we are really looking for some flexible and automatic way. If openstack scheduler can recognize this kind of CPU topology and make smart decision accordingly, it would be nice. Please let me know how you think about it. And alternative ways also welcome. [1] https://software.intel.com/content/www/us/en/develop/download/intel-64-and-i... [2] https://docs.openstack.org/nova/pike/admin/cpu-topologies.html Thanks Barry