[dev][tc]nova-scheduler: cpu-topology: sub-numa(numa nodes within one CPU socket) aware scheduler

10 May 2021

      Dears,

Historically, both AMD EPYC and Intel Xeon had NUMA nodes
within one CPU socket.

For Intel haswell/broadwell, they could have two rings to connect
multiple CPUs:

  +----------------------------+     +-------------------------+  
+---+                     +-----+  +----+                    +---+
+---+                     +-----+  +----+                    +---+
  |                            |     |                         |  
+---+                     +-----+  +----+    ring            +---+
+---+     ring            +-----+  +----+                    +---+
  |                            |     |                         |  
+---+                     +-----+  +----+                    +---+
+---+                     +-----+  +----+                    +---+
  |                            |     |                         |  
  |                            |     |                         |  
  +----------------------------+     +-------------------------+

Due to the latency to access the memory located in different
rings, cluster-on-die(COD) NUMA nodes did exist in haswell/
broadwell.

For AMD EPYC, they used to have 4 DIEs in one CPU socket:
+-------------------------------------------------+
| AMD EPYC                                        |
|      +-------------+       +-------------+      |
|      |             |       |             |      |
|      |   die       |       |   die       |      |
|      |             |--------             |      |
|      |             |\     /|             |      |
|      +-------------+ -\ /- +-------------+      |
|             |         /-\         |             |
|     +-------|------+/-   -+-------|------+      |
|     |              -      |              |      |
|     |   die        |      |  die         |      |
|     |              |------|              |      |
|     |              |      |              |      |
|     +--------------+      +--------------+      |
|                                                 |
+-------------------------------------------------+
These 4 different DIEs could be configured as 4 different NUMA.

However, with the development of hardware, Intel and AMD no
longer need to be aware of NUMA nodes within single one CPU.

Intel has moved to mesh since skylake(server):
+-------|-----------------------+
|       |       |       |       |
|       |       |       |       |
|-------|-------|-------|-------|
|       |       |       |       |
|       |       |       |       |
|-------|-------|-------|-------|
|       |       |       |       |
|       |       |       |       |
|-------|-------|-------|-------|
|       |       |       |       |
|       |       |       |       |
+-------------------------------+

AMD moved memory and I/O to a separate DIE, thus made the whole
CPU become UMA in EPYC II:

+---------+     +---------+    +---------+    +---------+
| cpu die |     |cpu die  |    |cpu die  |    |cpu die  |
|         |     |         |    |         |    |         |
+---------+     +---------+    +---------+    +---------+

+------------------------------------------------------+ 
|                                                      | 
|        memory and I/O DIE                            | 
|                                                      | 
+------------------------------------------------------+ 

+---------+     +---------+    +---------+    +---------+
|cpu die  |     |cpu die  |    |cpu die  |    |cpu die  |
|         |     |         |    |         |    |         |
+---------+     +---------+    +---------+    +---------+

Skylake, while still having SNC(sub-numa-cluster) within a DIE
for a huge mesh network, doesn't really suffer from this kind
of internal sub-numa since the latency difference is really
minor to enable SNC or not.

According to "64-ia-32-architectures-optimization-manual"[1],
for a typical 2P system, disabling and enabling SNC will bring
minor latency difference for memory access within one CPU:

SNC off:
Using buffer size of 2000.000MB
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 81.9 153.1
1 153.7 82.0

SNC on:
Using buffer size of 2000.000MB
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1 2 3
0 81.6 89.4 140.4 153.6
1 86.5 78.5 144.3 162.8
2 142.3 153.0 81.6 89.3
3 144.5 162.8 85.5 77.4

As above, if we disable sub-numa in one CPU, the CPU has memory
latency as 81.9, but if sub-numa is enabled, the advantage is
really minor, SNC can access its own memory with the latency of
81.6 with 0.3 gap only.

So SNC doesn't really matter on Xeon.

However, the development of kunpeng920's topology is different
from Intel and AMD.
Similar with AMD EPYC, kunpeng920 has two DIEs in one CPU. Unlike
EPYC which has only 8cores in each DIE, each DIE of kunpeng920
has 24 or 32 cores. For a typical 2P system, we can set it to
2NUMA or 4NUMA.

+------------------------------+    +------------------------------+
|    CPU                       |    |    CPU                       |
|  +----------+ +----------+   |    |  +----------+ +----------+   |
|  |          | |          |   |    |  |          | |          |   |
|  |  DIE     | |  DIE     |   |    |  |  DIE     | |  DIE     |   |
|  |          | |          |   |    |  |          | |          |   |
|  |          | |          |   |    |  |          | |          |   |
|  +----------+ +----------+   |    |  +----------+ +----------+   |
|                              |    |                              |
+------------------------------+    +------------------------------+

* 2NUMA - DIE interleave

In this case, two DIEs become one NUMA. The advantage is that we
are getting more CPUs in one NUMA, so this decreases the fragment
of CPUs and help deploy more virtual machines on the same host
when we apply the rule of pinning VM within one NUMA, compared
with disabling DIE interleave.

But, this has some obvious disadvantage. Since we need to run
DIE interleave, the average memory access latency could increase
much.

Enabling DIE interleave:
Numa node
Numa node 0 1
0  95.68  199.21
1  199.21 95.68

Disabling DIE interleave
Numa node
Numa node 0 1 2 3
0 85.79 104.33 189.95 209.00
1 104.33 85.79 209.00 229.60
2 189.95 209.00 85.79 104.33
3 209.00 229.60 104.33 85.79

As above, one DIE can access its local memory with latency
of 85.79, but when die-interleave is enabled, the latency
becomes 95.68. The gap 95.68-85.79 isn't minor.

* 4NUMA - each DIE becomes one NUMA

In this way, NUMA-aware software can access its local memory
with much lower latency. Thus, we can gain some performance
improvement in openstack if one VM is deployed in one DIE and
only access its local memory. Testing has shown 4%+ performance
improvement. However, under the rule that VM does not cross NUMA,
less VMs could be deployed on the same host.

In order to achieve both goals of performance improvement and
anti-fragmentation, we are looking for some way of sub-numa
awareness in openstack scheduler. That means,

1. we will disable DIE interleave, thus, for 2P system, we
are going to have 4 NUMA;
2. we use the rule VM shouldn't cross CPU, thinking one CPU
as a big NUMA and each DIE as a smaller sub-numa.
3. For guests, we can define different priorities, for high
priority VMs whose performance is a great concern,  openstack
will try to deploy them in same sub-numa which is a DIE for
kunpeng920; for low priority VMs who can endure across-die
latency, they could be placed on different sub-numa, but still
within same CPU.

So basically, we'd like to make openstack aware of the topology
of sub-numa within one cpu and support a rule which prevents
cross-CPU but allows cross-DIE. At the same time, this scheduler
should try its best to deploy important VMs in same sub-numa
and allow relatively unimportant VMs to cross two sub-numa.
This might be done by customized openstack flavor according to
[2], but we are really looking for some flexible and automatic
way. If openstack scheduler can recognize this kind of CPU
topology and make smart decision accordingly, it would be
nice.

Please let me know how you think about it. And alternative ways
also welcome.

[1] https://software.intel.com/content/www/us/en/develop/download/intel-64-and-i...
[2] https://docs.openstack.org/nova/pike/admin/cpu-topologies.html

Thanks
Barry

[dev][tc]nova-scheduler: cpu-topology: sub-numa(numa nodes within one CPU socket) aware scheduler

Song Bao Hua (Barry Song)