NUMATopologyFilter and AMD Epyc Rome
Hi all, We’re running into an issue with deploying our infrastructure to run high throughput, low latency workloads. Background: We run Lenovo SR635 systems with an AMD Epyc 7502P processor. In the BIOS of this system, we are able to define the amount of NUMA cells per socket (called NPS). We can set 1, 2 or 4. As we run a 2x 100Gbit/s Mellanox CX5 in this system as well, we use the preferred-io setting in the BIOS to give preferred io throughput to the Mellanox CX5. To make sure we get as high performance as possible, we set the NPS setting to 1, resulting in a single numa cell with 64 CPU threads available. Next, in Nova (train distribution), we demand huge pages. Hugepages however, demands a NUMAtopology, but as this is one large NUMA cell, even with cpu=dedicated or requesting a single numa domain, we fail: compute03, compute03 fails NUMA topology requirements. No host NUMA topology while the instance specified one. host_passes /usr/lib/python3/dist-packages/nova/scheduler/filters/numa_topology_filter.py:119 Any idea how to counter this? Setting NPS-2 will create two NUMA domains, but also cut our performance way down. Thanks! Regards, Eyle
On Thu, 2020-11-19 at 12:00 +0000, Eyle Brinkhuis wrote:
Hi all,
We’re running into an issue with deploying our infrastructure to run high throughput, low latency workloads.
Background:
We run Lenovo SR635 systems with an AMD Epyc 7502P processor. In the BIOS of this system, we are able to define the amount of NUMA cells per socket (called NPS). We can set 1, 2 or 4. As we run a 2x 100Gbit/s Mellanox CX5 in this system as well, we use the preferred- io setting in the BIOS to give preferred io throughput to the Mellanox CX5. To make sure we get as high performance as possible, we set the NPS setting to 1, resulting in a single numa cell with 64 CPU threads available.
Next, in Nova (train distribution), we demand huge pages. Hugepages however, demands a NUMAtopology, but as this is one large NUMA cell, even with cpu=dedicated or requesting a single numa domain, we fail:
compute03, compute03 fails NUMA topology requirements. No host NUMA topology while the instance specified one. host_passes /usr/lib/python3/dist- packages/nova/scheduler/filters/numa_topology_filter.py:119
Oh, this is interesting. This would suggest that when NPS is configured to 1, the host is presented as a UMA system and libvirt doesn't present topology information for us to parse. That seems odd and goes against how I though newer versions of libvirt worked. What do you see for when you run e.g.: $ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
Any idea how to counter this? Setting NPS-2 will create two NUMA domains, but also cut our performance way down.
It's worth noting that by setting NP1 to 1, you're already cutting your performance. This makes it look like you've got a single NUMA node but of course, that doesn't change the physical design of the chip and there are still multiple memory controllers, some of which will be slower to access to from certain cores. You're simply mixing best and worst case performance to provide an average. You said you have two SR- IOV NICs. I assume you're bonding these NICs? If not, you could set NPS to 2 and then ensure the NICs are in PCI slots that correspond to different NUMA nodes. You can validate this configuration using tools like 'lstopo' and 'numactl'. Stephen
Thanks!
Regards,
Eyle
On Thu, 2020-11-19 at 12:25 +0000, Stephen Finucane wrote:
On Thu, 2020-11-19 at 12:00 +0000, Eyle Brinkhuis wrote:
Hi all,
We’re running into an issue with deploying our infrastructure to run high throughput, low latency workloads.
Background:
We run Lenovo SR635 systems with an AMD Epyc 7502P processor. In the BIOS of this system, we are able to define the amount of NUMA cells per socket (called NPS). We can set 1, 2 or 4. As we run a 2x 100Gbit/s Mellanox CX5 in this system as well, we use the preferred-io setting in the BIOS to give preferred io throughput to the Mellanox CX5. To make sure we get as high performance as possible, we set the NPS setting to 1, resulting in a single numa cell with 64 CPU threads available.
Next, in Nova (train distribution), we demand huge pages. Hugepages however, demands a NUMAtopology, but as this is one large NUMA cell, even with cpu=dedicated or requesting a single numa domain, we fail:
compute03, compute03 fails NUMA topology requirements. No host NUMA topology while the instance specified one. host_passes /usr/lib/python3/dist- packages/nova/scheduler/filters/numa_topology_filter.py:119
Oh, this is interesting. This would suggest that when NPS is configured to 1, the host is presented as a UMA system and libvirt doesn't present topology information for us to parse. That seems odd and goes against how I though newer versions of libvirt worked.
What do you see for when you run e.g.:
$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
Also, what version of libvirt are you using? Past investigations [1] led me to believe that libvirt would now always present a NUMA topology for hosts, even if those hosts were in fact UMA. [1] https://github.com/openstack/nova/commit/c619c3b5847de85b21ffcbf750c10421d8b...
Any idea how to counter this? Setting NPS-2 will create two NUMA domains, but also cut our performance way down.
It's worth noting that by setting NP1 to 1, you're already cutting your performance. This makes it look like you've got a single NUMA node but of course, that doesn't change the physical design of the chip and there are still multiple memory controllers, some of which will be slower to access to from certain cores. You're simply mixing best and worst case performance to provide an average. You said you have two SR-IOV NICs. I assume you're bonding these NICs? If not, you could set NPS to 2 and then ensure the NICs are in PCI slots that correspond to different NUMA nodes. You can validate this configuration using tools like 'lstopo' and 'numactl'.
Stephen
Thanks!
Regards,
Eyle
Hi Stephen, We run: Compiled against library: libvirt 5.4.0 Using library: libvirt 5.4.0 Using API: QEMU 5.4.0 Running hypervisor: QEMU 4.0.0 ubuntu@compute02:~$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' - XPath set is empty (On a node with NPS-1) compute03:~$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' - <topology> <cells num="2"> <cell id="0"> <memory unit="KiB">65854792</memory> <pages unit="KiB" size="4">2383698</pages> <pages unit="KiB" size="2048">27500</pages> <pages unit="KiB" size="1048576">0</pages> <distances> <sibling id="0" value="10"/> <sibling id="1" value="12"/> </distances> <cpus num="32"> <cpu id="0" socket_id="0" core_id="0" siblings="0,32"/> <cpu id="1" socket_id="0" core_id="1" siblings="1,33"/> <cpu id="2" socket_id="0" core_id="2" siblings="2,34"/> <cpu id="3" socket_id="0" core_id="3" siblings="3,35"/> <cpu id="4" socket_id="0" core_id="4" siblings="4,36"/> <cpu id="5" socket_id="0" core_id="5" siblings="5,37"/> <cpu id="6" socket_id="0" core_id="6" siblings="6,38"/> <cpu id="7" socket_id="0" core_id="7" siblings="7,39"/> <cpu id="8" socket_id="0" core_id="8" siblings="8,40"/> <cpu id="9" socket_id="0" core_id="9" siblings="9,41"/> <cpu id="10" socket_id="0" core_id="10" siblings="10,42"/> <cpu id="11" socket_id="0" core_id="11" siblings="11,43"/> <cpu id="12" socket_id="0" core_id="12" siblings="12,44"/> <cpu id="13" socket_id="0" core_id="13" siblings="13,45"/> <cpu id="14" socket_id="0" core_id="14" siblings="14,46"/> <cpu id="15" socket_id="0" core_id="15" siblings="15,47"/> <cpu id="32" socket_id="0" core_id="0" siblings="0,32"/> <cpu id="33" socket_id="0" core_id="1" siblings="1,33"/> <cpu id="34" socket_id="0" core_id="2" siblings="2,34"/> <cpu id="35" socket_id="0" core_id="3" siblings="3,35"/> <cpu id="36" socket_id="0" core_id="4" siblings="4,36"/> <cpu id="37" socket_id="0" core_id="5" siblings="5,37"/> <cpu id="38" socket_id="0" core_id="6" siblings="6,38"/> <cpu id="39" socket_id="0" core_id="7" siblings="7,39"/> <cpu id="40" socket_id="0" core_id="8" siblings="8,40"/> <cpu id="41" socket_id="0" core_id="9" siblings="9,41"/> <cpu id="42" socket_id="0" core_id="10" siblings="10,42"/> <cpu id="43" socket_id="0" core_id="11" siblings="11,43"/> <cpu id="44" socket_id="0" core_id="12" siblings="12,44"/> <cpu id="45" socket_id="0" core_id="13" siblings="13,45"/> <cpu id="46" socket_id="0" core_id="14" siblings="14,46"/> <cpu id="47" socket_id="0" core_id="15" siblings="15,47"/> </cpus> </cell> <cell id="1"> <memory unit="KiB">66014072</memory> <pages unit="KiB" size="4">2423518</pages> <pages unit="KiB" size="2048">27500</pages> <pages unit="KiB" size="1048576">0</pages> <distances> <sibling id="0" value="12"/> <sibling id="1" value="10"/> </distances> <cpus num="32"> <cpu id="16" socket_id="0" core_id="16" siblings="16,48"/> <cpu id="17" socket_id="0" core_id="17" siblings="17,49"/> <cpu id="18" socket_id="0" core_id="18" siblings="18,50"/> <cpu id="19" socket_id="0" core_id="19" siblings="19,51"/> <cpu id="20" socket_id="0" core_id="20" siblings="20,52"/> <cpu id="21" socket_id="0" core_id="21" siblings="21,53"/> <cpu id="22" socket_id="0" core_id="22" siblings="22,54"/> <cpu id="23" socket_id="0" core_id="23" siblings="23,55"/> <cpu id="24" socket_id="0" core_id="24" siblings="24,56"/> <cpu id="25" socket_id="0" core_id="25" siblings="25,57"/> <cpu id="26" socket_id="0" core_id="26" siblings="26,58"/> <cpu id="27" socket_id="0" core_id="27" siblings="27,59"/> <cpu id="28" socket_id="0" core_id="28" siblings="28,60"/> <cpu id="29" socket_id="0" core_id="29" siblings="29,61"/> <cpu id="30" socket_id="0" core_id="30" siblings="30,62"/> <cpu id="31" socket_id="0" core_id="31" siblings="31,63"/> <cpu id="48" socket_id="0" core_id="16" siblings="16,48"/> <cpu id="49" socket_id="0" core_id="17" siblings="17,49"/> <cpu id="50" socket_id="0" core_id="18" siblings="18,50"/> <cpu id="51" socket_id="0" core_id="19" siblings="19,51"/> <cpu id="52" socket_id="0" core_id="20" siblings="20,52"/> <cpu id="53" socket_id="0" core_id="21" siblings="21,53"/> <cpu id="54" socket_id="0" core_id="22" siblings="22,54"/> <cpu id="55" socket_id="0" core_id="23" siblings="23,55"/> <cpu id="56" socket_id="0" core_id="24" siblings="24,56"/> <cpu id="57" socket_id="0" core_id="25" siblings="25,57"/> <cpu id="58" socket_id="0" core_id="26" siblings="26,58"/> <cpu id="59" socket_id="0" core_id="27" siblings="27,59"/> <cpu id="60" socket_id="0" core_id="28" siblings="28,60"/> <cpu id="61" socket_id="0" core_id="29" siblings="29,61"/> <cpu id="62" socket_id="0" core_id="30" siblings="30,62"/> <cpu id="63" socket_id="0" core_id="31" siblings="31,63"/> </cpus> </cell> </cells> </topology> (On a node with NPS-2)
It's worth noting that by setting NP1 to 1, you're already cutting your performance. This makes it look like you've got a single NUMA node but of course, that doesn't change the physical design of the chip and there are still multiple memory controllers, some of which will be slower to access to from certain cores. You're simply mixing best and worst case performance to provide an average. You said you have two SR-IOV NICs. I assume you're bonding these NICs? If not, you could set NPS to 2 and then ensure the NICs are in PCI slots that correspond to different NUMA nodes. You can validate this configuration using tools like 'lstopo' and 'numactl'. Our setup is a little different. We don’t use any OVS or SR-IOV. We use FDio’s VPP, with networking-vpp as switch, and use VPP’s RDMA capabilities to haul packets left and right. Our performance tuning sessions on these machines, without an openstack setup (so throughput in VPP) showed that NPS-1 is the best setting for us. We are only using one CX5 by the way, and use both ports (2x100G) in a LACP setup for redundancy.
Thanks for your quick reply! Regards, Eyle
On 19 Nov 2020, at 13:31, Stephen Finucane <stephenfin@redhat.com> wrote:
On Thu, 2020-11-19 at 12:25 +0000, Stephen Finucane wrote:
On Thu, 2020-11-19 at 12:00 +0000, Eyle Brinkhuis wrote:
Hi all,
We’re running into an issue with deploying our infrastructure to run high throughput, low latency workloads.
Background:
We run Lenovo SR635 systems with an AMD Epyc 7502P processor. In the BIOS of this system, we are able to define the amount of NUMA cells per socket (called NPS). We can set 1, 2 or 4. As we run a 2x 100Gbit/s Mellanox CX5 in this system as well, we use the preferred-io setting in the BIOS to give preferred io throughput to the Mellanox CX5. To make sure we get as high performance as possible, we set the NPS setting to 1, resulting in a single numa cell with 64 CPU threads available.
Next, in Nova (train distribution), we demand huge pages. Hugepages however, demands a NUMAtopology, but as this is one large NUMA cell, even with cpu=dedicated or requesting a single numa domain, we fail:
compute03, compute03 fails NUMA topology requirements. No host NUMA topology while the instance specified one. host_passes /usr/lib/python3/dist-packages/nova/scheduler/filters/numa_topology_filter.py:119
Oh, this is interesting. This would suggest that when NPS is configured to 1, the host is presented as a UMA system and libvirt doesn't present topology information for us to parse. That seems odd and goes against how I though newer versions of libvirt worked.
What do you see for when you run e.g.:
$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
Also, what version of libvirt are you using? Past investigations [1] led me to believe that libvirt would now always present a NUMA topology for hosts, even if those hosts were in fact UMA.
[1] https://github.com/openstack/nova/commit/c619c3b5847de85b21ffcbf750c10421d8b...
Any idea how to counter this? Setting NPS-2 will create two NUMA domains, but also cut our performance way down.
It's worth noting that by setting NP1 to 1, you're already cutting your performance. This makes it look like you've got a single NUMA node but of course, that doesn't change the physical design of the chip and there are still multiple memory controllers, some of which will be slower to access to from certain cores. You're simply mixing best and worst case performance to provide an average. You said you have two SR-IOV NICs. I assume you're bonding these NICs? If not, you could set NPS to 2 and then ensure the NICs are in PCI slots that correspond to different NUMA nodes. You can validate this configuration using tools like 'lstopo' and 'numactl'.
Stephen
Thanks!
Regards,
Eyle
On Thu, 2020-11-19 at 12:56 +0000, Eyle Brinkhuis wrote:
Hi Stephen,
We run: Compiled against library: libvirt 5.4.0 Using library: libvirt 5.4.0 Using API: QEMU 5.4.0 Running hypervisor: QEMU 4.0.0
ubuntu@compute02:~$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' - XPath set is empty (On a node with NPS-1)
compute03:~$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' - <topology> <cells num="2"> <cell id="0"> <memory unit="KiB">65854792</memory> <pages unit="KiB" size="4">2383698</pages> <pages unit="KiB" size="2048">27500</pages> <pages unit="KiB" size="1048576">0</pages> <distances> <sibling id="0" value="10"/> <sibling id="1" value="12"/> </distances> <cpus num="32"> <cpu id="0" socket_id="0" core_id="0" siblings="0,32"/> <cpu id="1" socket_id="0" core_id="1" siblings="1,33"/> <cpu id="2" socket_id="0" core_id="2" siblings="2,34"/> <cpu id="3" socket_id="0" core_id="3" siblings="3,35"/> <cpu id="4" socket_id="0" core_id="4" siblings="4,36"/> <cpu id="5" socket_id="0" core_id="5" siblings="5,37"/> <cpu id="6" socket_id="0" core_id="6" siblings="6,38"/> <cpu id="7" socket_id="0" core_id="7" siblings="7,39"/> <cpu id="8" socket_id="0" core_id="8" siblings="8,40"/> <cpu id="9" socket_id="0" core_id="9" siblings="9,41"/> <cpu id="10" socket_id="0" core_id="10" siblings="10,42"/> <cpu id="11" socket_id="0" core_id="11" siblings="11,43"/> <cpu id="12" socket_id="0" core_id="12" siblings="12,44"/> <cpu id="13" socket_id="0" core_id="13" siblings="13,45"/> <cpu id="14" socket_id="0" core_id="14" siblings="14,46"/> <cpu id="15" socket_id="0" core_id="15" siblings="15,47"/> <cpu id="32" socket_id="0" core_id="0" siblings="0,32"/> <cpu id="33" socket_id="0" core_id="1" siblings="1,33"/> <cpu id="34" socket_id="0" core_id="2" siblings="2,34"/> <cpu id="35" socket_id="0" core_id="3" siblings="3,35"/> <cpu id="36" socket_id="0" core_id="4" siblings="4,36"/> <cpu id="37" socket_id="0" core_id="5" siblings="5,37"/> <cpu id="38" socket_id="0" core_id="6" siblings="6,38"/> <cpu id="39" socket_id="0" core_id="7" siblings="7,39"/> <cpu id="40" socket_id="0" core_id="8" siblings="8,40"/> <cpu id="41" socket_id="0" core_id="9" siblings="9,41"/> <cpu id="42" socket_id="0" core_id="10" siblings="10,42"/> <cpu id="43" socket_id="0" core_id="11" siblings="11,43"/> <cpu id="44" socket_id="0" core_id="12" siblings="12,44"/> <cpu id="45" socket_id="0" core_id="13" siblings="13,45"/> <cpu id="46" socket_id="0" core_id="14" siblings="14,46"/> <cpu id="47" socket_id="0" core_id="15" siblings="15,47"/> </cpus> </cell> <cell id="1"> <memory unit="KiB">66014072</memory> <pages unit="KiB" size="4">2423518</pages> <pages unit="KiB" size="2048">27500</pages> <pages unit="KiB" size="1048576">0</pages> <distances> <sibling id="0" value="12"/> <sibling id="1" value="10"/> </distances> <cpus num="32"> <cpu id="16" socket_id="0" core_id="16" siblings="16,48"/> <cpu id="17" socket_id="0" core_id="17" siblings="17,49"/> <cpu id="18" socket_id="0" core_id="18" siblings="18,50"/> <cpu id="19" socket_id="0" core_id="19" siblings="19,51"/> <cpu id="20" socket_id="0" core_id="20" siblings="20,52"/> <cpu id="21" socket_id="0" core_id="21" siblings="21,53"/> <cpu id="22" socket_id="0" core_id="22" siblings="22,54"/> <cpu id="23" socket_id="0" core_id="23" siblings="23,55"/> <cpu id="24" socket_id="0" core_id="24" siblings="24,56"/> <cpu id="25" socket_id="0" core_id="25" siblings="25,57"/> <cpu id="26" socket_id="0" core_id="26" siblings="26,58"/> <cpu id="27" socket_id="0" core_id="27" siblings="27,59"/> <cpu id="28" socket_id="0" core_id="28" siblings="28,60"/> <cpu id="29" socket_id="0" core_id="29" siblings="29,61"/> <cpu id="30" socket_id="0" core_id="30" siblings="30,62"/> <cpu id="31" socket_id="0" core_id="31" siblings="31,63"/> <cpu id="48" socket_id="0" core_id="16" siblings="16,48"/> <cpu id="49" socket_id="0" core_id="17" siblings="17,49"/> <cpu id="50" socket_id="0" core_id="18" siblings="18,50"/> <cpu id="51" socket_id="0" core_id="19" siblings="19,51"/> <cpu id="52" socket_id="0" core_id="20" siblings="20,52"/> <cpu id="53" socket_id="0" core_id="21" siblings="21,53"/> <cpu id="54" socket_id="0" core_id="22" siblings="22,54"/> <cpu id="55" socket_id="0" core_id="23" siblings="23,55"/> <cpu id="56" socket_id="0" core_id="24" siblings="24,56"/> <cpu id="57" socket_id="0" core_id="25" siblings="25,57"/> <cpu id="58" socket_id="0" core_id="26" siblings="26,58"/> <cpu id="59" socket_id="0" core_id="27" siblings="27,59"/> <cpu id="60" socket_id="0" core_id="28" siblings="28,60"/> <cpu id="61" socket_id="0" core_id="29" siblings="29,61"/> <cpu id="62" socket_id="0" core_id="30" siblings="30,62"/> <cpu id="63" socket_id="0" core_id="31" siblings="31,63"/> </cpus> </cell> </cells> </topology> (On a node with NPS-2)
It's worth noting that by setting NP1 to 1, you're already cutting your performance. This makes it look like you've got a single NUMA node but of course, that doesn't change the physical design of the chip and there are still multiple memory controllers, some of which will be slower to access to from certain cores. You're simply mixing best and worst case performance to provide an average. You said you have two SR-IOV NICs. I assume you're bonding these NICs? If not, you could set NPS to 2 and then ensure the NICs are in PCI slots that correspond to different NUMA nodes. You can validate this configuration using tools like 'lstopo' and 'numactl'. Our setup is a little different. We don’t use any OVS or SR-IOV. We use FDio’s VPP, with networking-vpp as switch, and use VPP’s RDMA capabilities to haul packets left and right. Our performance tuning sessions on these machines, without an openstack setup (so throughput in VPP) showed that NPS-1 is the best setting for us. We are only using one CX5 by the way, and use both ports (2x100G) in a LACP setup for redundancy.
interesting when you set NPS to 4 did you ensure you have 1 PMD per numa node. when using dpdk you should normally have 1 PMD per numa node. the other thing to note is that you cant assume that the nic even if attache to socket 0 will be on numa 0 when you set NPS=4 we havesee it on other numa nodes in some test we have done so if you only have 1 PMD per socket enabeld you woudl want to ensure its on a core in the same numa ndoe as the nic.
Thanks for your quick reply!
Regards,
Eyle
On 19 Nov 2020, at 13:31, Stephen Finucane <stephenfin@redhat.com> wrote:
On Thu, 2020-11-19 at 12:25 +0000, Stephen Finucane wrote:
On Thu, 2020-11-19 at 12:00 +0000, Eyle Brinkhuis wrote:
Hi all,
We’re running into an issue with deploying our infrastructure to run high throughput, low latency workloads.
Background:
We run Lenovo SR635 systems with an AMD Epyc 7502P processor. In the BIOS of this system, we are able to define the amount of NUMA cells per socket (called NPS). We can set 1, 2 or 4. As we run a 2x 100Gbit/s Mellanox CX5 in this system as well, we use the preferred-io setting in the BIOS to give preferred io throughput to the Mellanox CX5. To make sure we get as high performance as possible, we set the NPS setting to 1, resulting in a single numa cell with 64 CPU threads available.
Next, in Nova (train distribution), we demand huge pages. Hugepages however, demands a NUMAtopology, but as this is one large NUMA cell, even with cpu=dedicated or requesting a single numa domain, we fail:
compute03, compute03 fails NUMA topology requirements. No host NUMA topology while the instance specified one. host_passes /usr/lib/python3/dist-packages/nova/scheduler/filters/numa_topology_filter.py:119
Oh, this is interesting. This would suggest that when NPS is configured to 1, the host is presented as a UMA system and libvirt doesn't present topology information for us to parse. That seems odd and goes against how I though newer versions of libvirt worked.
What do you see for when you run e.g.:
$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
Also, what version of libvirt are you using? Past investigations [1] led me to believe that libvirt would now always present a NUMA topology for hosts, even if those hosts were in fact UMA.
[1] https://github.com/openstack/nova/commit/c619c3b5847de85b21ffcbf750c10421d8b...
Any idea how to counter this? Setting NPS-2 will create two NUMA domains, but also cut our performance way down.
It's worth noting that by setting NP1 to 1, you're already cutting your performance. This makes it look like you've got a single NUMA node but of course, that doesn't change the physical design of the chip and there are still multiple memory controllers, some of which will be slower to access to from certain cores. You're simply mixing best and worst case performance to provide an average. You said you have two SR-IOV NICs. I assume you're bonding these NICs? If not, you could set NPS to 2 and then ensure the NICs are in PCI slots that correspond to different NUMA nodes. You can validate this configuration using tools like 'lstopo' and 'numactl'.
Stephen
Thanks!
Regards,
Eyle
On Thu, 2020-11-19 at 12:25 +0000, Stephen Finucane wrote:
On Thu, 2020-11-19 at 12:00 +0000, Eyle Brinkhuis wrote:
Hi all,
We’re running into an issue with deploying our infrastructure to run high throughput, low latency workloads.
Background:
We run Lenovo SR635 systems with an AMD Epyc 7502P processor. In the BIOS of this system, we are able to define the amount of NUMA cells per socket (called NPS). We can set 1, 2 or 4. As we run a 2x 100Gbit/s Mellanox CX5 in this system as well, we use the preferred-io setting in the BIOS to give preferred io throughput to the Mellanox CX5. To make sure we get as high performance as possible, we set the NPS setting to 1, resulting in a single numa cell with 64 CPU threads available. from what data i have personllay seen on this topic this will pessimise your
On Thu, 2020-11-19 at 12:31 +0000, Stephen Finucane wrote: perfromance and you should be setting it to 4 if you set it to 1 and place the test applicatio on for example cpus 60-64 you will see a performance reduction in comparisone to cpus 4-8 if you enable 4 numa nodes per socket the 3 that do not have the nic will have more or less teh same performance which should be better then the perfromace when it was on 60-64 but the one with the nic will have better perfromance which may actully exceed the performance you see with the vm/applcaiton running on cores 4-8 the preliminary data our perfomance engineers have seen show that some workloads like small packet netwrok io can see performance improvment of up to 30+% in some workloads (dpdk's testpmd) and 8% improvement in less memory sensitive workloads setting NPS=4 i know mohammed naser looked into this too for vexhost in the past and was seeing similar effect. im not sure if you can share your general finding but did you end up with NPS=4 in the end?
Next, in Nova (train distribution), we demand huge pages. Hugepages however, demands a NUMAtopology, but as this is one large NUMA cell, even with cpu=dedicated or requesting a single numa domain, we fail:
compute03, compute03 fails NUMA topology requirements. No host NUMA topology while the instance specified one. host_passes /usr/lib/python3/dist- packages/nova/scheduler/filters/numa_topology_filter.py:119
Oh, this is interesting. This would suggest that when NPS is configured to 1, the host is presented as a UMA system and libvirt doesn't present topology information for us to parse. That seems odd and goes against how I though newer versions of libvirt worked.
What do you see for when you run e.g.:
$ virsh capabilities | xmllint --xpath '/capabilities/host/topology' -
Also, what version of libvirt are you using? Past investigations [1] led me to believe that libvirt would now always present a NUMA topology for hosts, even if those hosts were in fact UMA.
[1] https://github.com/openstack/nova/commit/c619c3b5847de85b21ffcbf750c10421d8b...
libvirt was broken on amd systems with nps=1 due to a workaround implemented for non x86 architures https://bugzilla.redhat.com/show_bug.cgi?id=1860231 that should now by adressed but very recently.
Any idea how to counter this? Setting NPS-2 will create two NUMA domains, but also cut our performance way down.
It's worth noting that by setting NP1 to 1, you're already cutting your performance. This makes it look like you've got a single NUMA node but of course, that doesn't change the physical design of the chip and there are still multiple memory controllers, some of which will be slower to access to from certain cores. You're simply mixing best and worst case performance to provide an average. You said you have two SR-IOV NICs. I assume you're bonding these NICs? If not, you could set NPS to 2 and then ensure the NICs are in PCI slots that correspond to different NUMA nodes. You can validate this configuration using tools like 'lstopo' and 'numactl'.
yep setting it to 1 will just disable the reporting of the real numa toplogy and basically tieing all the memroy contolers in the socket to act as one but that generally increases latency and decreasess perfromance. it also does not
actully that should improve perfromance based on most benchmarks we have seen and work we have been doing with amd on this topic. the data that i have review so far indeicates that the highest memory bandwith and lowest latency occurs when you expose all the numa nodes on the host by setting NPS to the largest value for you given cpu. provide the information need by the kernel or openstack to optimisze. the main issue we have right now form an openstack point of view is sriov we support numa affintiy but not not socket affintiy or numa distance socket affinity is what you want 80% of the time numa distance is much more complex and is what you actully want but socket affinity is a very good proxy for it. to use sriov with numa guests such as hugepage guests you have to disable numa affinity for sriov devices in if you have hsots with multiple numa nodes per socket today if you want to be able to use all cores. if not all vms will use sriov then you can still use strict/legacy affinity instead of prefered.
Stephen
Thanks!
Regards,
Eyle
participants (3)
-
Eyle Brinkhuis
-
Sean Mooney
-
Stephen Finucane