[nova] NUMA scheduling

newer
[Multi-arch SIG][OpenInfra Summit]...

Eric K. Miller

16 Oct 2020 16 Oct '20

10:49 p.m.

Hi, I'm at a loss for finding good information about how a VM's vCPUs and Memory are assigned to NUMA nodes within a scheduled physical host. I think Libvirt does this, and the Nova Scheduler simply finds the right physical host to run the VM, and thus Nova has no input on which NUMA node to choose. So this might be a Libvirt question. We are running Stein and have the issue where VMs launch on NUMA Node 0, and not on NUMA Node 1, in physical hosts with two processors, and are simply looking for a way to tell Libvirt to consider NUMA Node 1 when scheduling a VM, since there is nearly all of the memory available on NUMA Node 1. Our flavors are defined with hw:numa_nodes='1' since we want all vCPUs+Memory to land on a single NUMA Node, and so the guest OS has visibility that a single NUMA Node is being used. We are "not" looking for a way to pin a VM to a specific NUMA node (such as for SR-IOV purposes). Any suggestions where to look for the solution? Thanks! Eric

Attachments:

attachment.html (text/html — 2.9 KB)

Show replies by date

Laurent Dumont

17 Oct 17 Oct

3:19 a.m.

As far as I know, numa_nodes=1 just means --> the resources for that VM should run on one NUMA node (so either NUMA0 or NUMA1). If there is space free on both, then it's probably going to pick one of the two? On Fri, Oct 16, 2020 at 6:56 PM Eric K. Miller <emiller@genesishosting.com> wrote:

...

Hi,

I'm at a loss for finding good information about how a VM's vCPUs and Memory are assigned to NUMA nodes within a scheduled physical host. I think Libvirt does this, and the Nova Scheduler simply finds the right physical host to run the VM, and thus Nova has no input on which NUMA node to choose. So this might be a Libvirt question.

We are running Stein and have the issue where VMs launch on NUMA Node 0, and not on NUMA Node 1, in physical hosts with two processors, and are simply looking for a way to tell Libvirt to consider NUMA Node 1 when scheduling a VM, since there is nearly all of the memory available on NUMA Node 1.

Our flavors are defined with hw:numa_nodes='1' since we want all vCPUs+Memory to land on a single NUMA Node, and so the guest OS has visibility that a single NUMA Node is being used.

We are "not" looking for a way to pin a VM to a specific NUMA node (such as for SR-IOV purposes).

Any suggestions where to look for the solution?

Thanks!

Eric

Eric K. Miller

3:47 a.m.

...

As far as I know, numa_nodes=1 just means --> the resources for that VM should run on one NUMA node (so either NUMA0 or NUMA1). If there is space free on both, then it's probably going to pick one of the two?

I thought the same, but it appears that VMs are never scheduled on NUMA1 even though NUMA0 is full (causing OOM to trigger and kill running VMs). I would have hoped that a NUMA node was treated like a host, and thus "VMs being balanced across nodes". The discussion on NUMA handling is long, so I was hoping that there might be information about the latest solution to the problem - or to be told that there isn't a good solution other than using huge pages. Eric

Erik Olof Gunnar Andersson

4:04 a.m.

We have been running with NUMA configured for a long time and don't believe I have seen this behavior. It's important that you configure the flavors / aggregates correct. I think this might be what you are looking for penstack flavor set m1.large --property hw:cpu_policy=dedicated https://docs.openstack.org/nova/pike/admin/cpu-topologies.html Pretty sure we also set this for any flavor that only requires a single NUMA zone openstack flavor set m1.large --property hw:numa_nodes=1 ________________________________ From: Eric K. Miller <emiller@genesishosting.com> Sent: Friday, October 16, 2020 8:47 PM To: Laurent Dumont <laurentfdumont@gmail.com> Cc: openstack-discuss <openstack-discuss@lists.openstack.org> Subject: RE: [nova] NUMA scheduling

...

As far as I know, numa_nodes=1 just means --> the resources for that VM should run on one NUMA node (so either NUMA0 or NUMA1). If there is space free on both, then it's probably going to pick one of the two?

Eric K. Miller

5:13 a.m.

...

We have been running with NUMA configured for a long time and don't believe I have seen this behavior. It's important that you configure the flavors / aggregates correct.

We are not looking for pinned CPUs - rather we want shared CPUs within a single NUMA node. Our flavor properties, for one particular flavor, are: hw:cpu_cores='4', hw:cpu_policy='shared', hw:cpu_sockets='1', hw:numa_nodes='1' We already have separate aggregates for dedicated and shared cpu_policy flavors.

...

Pretty sure we also set this for any flavor that only requires a single NUMA zone openstack flavor set m1.large --property hw:numa_nodes=1

I thought so too, but it doesn't look like the above properties are allowing VMs to be provisioned on the second NUMA node.

Satish Patel

3:05 p.m.

This is very odd, I am running NUMA aware openstack cloud and my VMs are getting scheduled on both sides of NUMA zone. Following is my flavor settings. Also I am using huge pages for performance. (make sure you have NUMATopologyFilter filter configured). hw:cpu_policy='dedicated', hw:cpu_sockets='2', hw:cpu_threads='2', hw:mem_page_size='large' what if you remove hw:numa_nodes=1 ? ~S On Sat, Oct 17, 2020 at 1:21 AM Eric K. Miller <emiller@genesishosting.com> wrote:

...

...
We have been running with NUMA configured for a long time and don't believe I have seen this behavior. It's important that you configure the flavors / aggregates correct.

We are not looking for pinned CPUs - rather we want shared CPUs within a single NUMA node.

Our flavor properties, for one particular flavor, are: hw:cpu_cores='4', hw:cpu_policy='shared', hw:cpu_sockets='1', hw:numa_nodes='1'

We already have separate aggregates for dedicated and shared cpu_policy flavors.

...
Pretty sure we also set this for any flavor that only requires a single NUMA zone openstack flavor set m1.large --property hw:numa_nodes=1

I thought so too, but it doesn't look like the above properties are allowing VMs to be provisioned on the second NUMA node.

Eric K. Miller

3:31 p.m.

Hi Satish,

...

This is very odd, I am running NUMA aware openstack cloud and my VMs are getting scheduled on both sides of NUMA zone. Following is my flavor settings. Also I am using huge pages for performance. (make sure you have NUMATopologyFilter filter configured).

hw:cpu_policy='dedicated', hw:cpu_sockets='2', hw:cpu_threads='2', hw:mem_page_size='large'

what if you remove hw:numa_nodes=1 ?

Note that we are using a shared CPU policy (for various hosts). I don't know if this is causing our issue or not, but we definitely do not want to pin CPUs to VMs on these hosts. Without the hw:numa_nodes property, an individual VM is created with its vCPUs and Memory divided between the two NUMA nodes, which is not what we would prefer. We would prefer, instead, to have all vCPUs and Memory for the VM placed into a single NUMA node so all cores of the VM have access to this NUMA node's memory instead of having one core require cross-NUMA communications. With large core processors and large amounts of memory, it doesn't make much sense to have small VMs (such as 4 core VMs) span two NUMA nodes. With our current settings, every VM is placed into a single NUMA node (as we wanted), but they always land in NUMA node 0 and never in NUMA node 1. It does, however, appear that QEMU's memory overhead and Linux' buffer/cache is landing in NUMA node 1. Native processes on the hosts are spread between NUMA nodes. We don't have huge pages enabled, so we have not enabled the NUMATopologyFilter. Eric

Laurent Dumont

5:02 p.m.

What is the error thrown by Openstack when NUMA0 is full? On Sat, Oct 17, 2020 at 11:40 AM Eric K. Miller <emiller@genesishosting.com> wrote:

...

Hi Satish,

...
This is very odd, I am running NUMA aware openstack cloud and my VMs are getting scheduled on both sides of NUMA zone. Following is my flavor settings. Also I am using huge pages for performance. (make sure you have NUMATopologyFilter filter configured).

hw:cpu_policy='dedicated', hw:cpu_sockets='2', hw:cpu_threads='2', hw:mem_page_size='large'

what if you remove hw:numa_nodes=1 ?

Note that we are using a shared CPU policy (for various hosts). I don't know if this is causing our issue or not, but we definitely do not want to pin CPUs to VMs on these hosts.

Without the hw:numa_nodes property, an individual VM is created with its vCPUs and Memory divided between the two NUMA nodes, which is not what we would prefer. We would prefer, instead, to have all vCPUs and Memory for the VM placed into a single NUMA node so all cores of the VM have access to this NUMA node's memory instead of having one core require cross-NUMA communications.

With large core processors and large amounts of memory, it doesn't make much sense to have small VMs (such as 4 core VMs) span two NUMA nodes.

With our current settings, every VM is placed into a single NUMA node (as we wanted), but they always land in NUMA node 0 and never in NUMA node 1. It does, however, appear that QEMU's memory overhead and Linux' buffer/cache is landing in NUMA node 1. Native processes on the hosts are spread between NUMA nodes.

We don't have huge pages enabled, so we have not enabled the NUMATopologyFilter.

Eric

Eric K. Miller

5:18 p.m.

...

What is the error thrown by Openstack when NUMA0 is full?

OOM is actually killing the QEMU process, which causes Nova to report: /var/log/kolla/nova/nova-compute.log.4:2020-08-25 12:31:19.812 6 WARNING nova.compute.manager [req-62bddc53-ca8b-4bdc-bf41-8690fc88076f - - - - -] [instance: 8d8a262a-6e60-4e8a-97f9-14462f09b9e5] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4 So, there isn't a NUMA or memory-specific error from Nova - Nova is simply scheduling a VM on a node that it thinks has enough memory, and Libvirt (or Nova?) is configuring the VM to use CPU cores on a full NUMA node. NUMA Node 1 had about 240GiB of free memory with about 100GiB of buffer/cache space used, so plenty of free memory, whereas NUMA Node 0 was pretty tight on free memory. These are some logs in /var/log/messages (not for the nova-compute.log entry above, but the same condition for a VM that was killed - logs were rolled, so I had to pick a different VM): Oct 10 15:17:01 <redacted hostname> kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0 Oct 10 15:17:01 <redacted hostname> kernel: CPU: 15 PID: 30468 Comm: CPU 0/KVM Not tainted 5.3.8-1.el7.elrepo.x86_64 #1 Oct 10 15:17:01 <redacted hostname> kernel: Hardware name: <redacted hardware> Oct 10 15:17:01 <redacted hostname> kernel: Call Trace: Oct 10 15:17:01 <redacted hostname> kernel: dump_stack+0x63/0x88 Oct 10 15:17:01 <redacted hostname> kernel: dump_header+0x51/0x210 Oct 10 15:17:01 <redacted hostname> kernel: oom_kill_process+0x105/0x130 Oct 10 15:17:01 <redacted hostname> kernel: out_of_memory+0x105/0x4c0 … … Oct 10 15:17:01 <redacted hostname> kernel: active_anon:108933472 inactive_anon:174036 isolated_anon:0#012 active_file:21875969 inactive_file:2418794 isolated_file:32#012 unevictable:88113 dirty:0 writeback:4 unstable:0#012 slab_reclaimable:3056118 slab_unreclaimable:432301#012 mapped:71768 shmem:570159 pagetables:258264 bounce:0#012 free:58924792 free_pcp:326 free_cma:0 Oct 10 15:17:01 <redacted hostname> kernel: Node 0 active_anon:382548916kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB isolated(anon):0kB isolated(file):128kB mapped:16696kB dirty:0kB writeback:0kB shmem:578812kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 286420992kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA free:15880kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15880kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 1589 385604 385604 385604 Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32 free:1535904kB min:180kB low:1780kB high:3380kB active_anon:90448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1717888kB managed:1627512kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1008kB local_pcp:248kB free_cma:0kB Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 384015 384015 384015 Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal free:720756kB min:818928kB low:1212156kB high:1605384kB active_anon:382458300kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB writepending:0kB present:399507456kB managed:393231952kB mlocked:289840kB kernel_stack:58344kB pagetables:889796kB bounce:0kB free_pcp:296kB local_pcp:0kB free_cma:0kB Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 0 0 0 Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32: 1*4kB (U) 1*8kB (M) 0*16kB 9*32kB (UM) 11*64kB (UM) 12*128kB (UM) 12*256kB (UM) 11*512kB (UM) 11*1024kB (M) 1*2048kB (U) 369*4096kB (M) = 1535980kB Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal: 76633*4kB (UME) 30442*8kB (UME) 7998*16kB (UME) 1401*32kB (UE) 6*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 723252kB Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Oct 10 15:17:01 <redacted hostname> kernel: 24866489 total pagecache pages Oct 10 15:17:01 <redacted hostname> kernel: 0 pages in swap cache Oct 10 15:17:01 <redacted hostname> kernel: Swap cache stats: add 0, delete 0, find 0/0 Oct 10 15:17:01 <redacted hostname> kernel: Free swap = 0kB Oct 10 15:17:01 <redacted hostname> kernel: Total swap = 0kB Oct 10 15:17:01 <redacted hostname> kernel: 200973631 pages RAM Oct 10 15:17:01 <redacted hostname> kernel: 0 pages HighMem/MovableOnly Oct 10 15:17:01 <redacted hostname> kernel: 3165617 pages reserved Oct 10 15:17:01 <redacted hostname> kernel: 0 pages hwpoisoned Oct 10 15:17:01 <redacted hostname> kernel: Tasks state (memory values in pages): Oct 10 15:17:01 <redacted hostname> kernel: [ 2414] 0 2414 33478 20111 315392 0 0 systemd-journal Oct 10 15:17:01 <redacted hostname> kernel: [ 2438] 0 2438 31851 540 143360 0 0 lvmetad Oct 10 15:17:01 <redacted hostname> kernel: [ 2453] 0 2453 12284 1141 131072 0 -1000 systemd-udevd Oct 10 15:17:01 <redacted hostname> kernel: [ 4170] 0 4170 13885 446 131072 0 -1000 auditd Oct 10 15:17:01 <redacted hostname> kernel: [ 4393] 0 4393 5484 526 86016 0 0 irqbalance Oct 10 15:17:01 <redacted hostname> kernel: [ 4394] 0 4394 6623 624 102400 0 0 systemd-logind … … Oct 10 15:17:01 <redacted hostname> kernel: oom-kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=0,cpuset=vcpu0,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine-qemu\x2d237\x2dinstance\x2d0000fda8.scope,task=qemu-kvm,pid=25496,uid=42436 Oct 10 15:17:01 <redacted hostname> kernel: Out of memory: Killed process 25496 (qemu-kvm) total-vm:67989512kB, anon-rss:66780940kB, file-rss:11052kB, shmem-rss:4kB Oct 10 15:17:02 <redacted hostname> kernel: oom_reaper: reaped process 25496 (qemu-kvm), now anon-rss:0kB, file-rss:36kB, shmem-rss:4kB

Satish Patel

5:41 p.m.

I would say try without "hw:numa_nodes=1" in flavor properties. ~S On Sat, Oct 17, 2020 at 1:28 PM Eric K. Miller <emiller@genesishosting.com> wrote:

...

...
What is the error thrown by Openstack when NUMA0 is full?

OOM is actually killing the QEMU process, which causes Nova to report:

/var/log/kolla/nova/nova-compute.log.4:2020-08-25 12:31:19.812 6 WARNING nova.compute.manager [req-62bddc53-ca8b-4bdc-bf41-8690fc88076f - - - - -] [instance: 8d8a262a-6e60-4e8a-97f9-14462f09b9e5] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4

So, there isn't a NUMA or memory-specific error from Nova - Nova is simply scheduling a VM on a node that it thinks has enough memory, and Libvirt (or Nova?) is configuring the VM to use CPU cores on a full NUMA node.

NUMA Node 1 had about 240GiB of free memory with about 100GiB of buffer/cache space used, so plenty of free memory, whereas NUMA Node 0 was pretty tight on free memory.

These are some logs in /var/log/messages (not for the nova-compute.log entry above, but the same condition for a VM that was killed - logs were rolled, so I had to pick a different VM):

Oct 10 15:17:01 <redacted hostname> kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0

Oct 10 15:17:01 <redacted hostname> kernel: CPU: 15 PID: 30468 Comm: CPU 0/KVM Not tainted 5.3.8-1.el7.elrepo.x86_64 #1

Oct 10 15:17:01 <redacted hostname> kernel: Hardware name: <redacted hardware>

Oct 10 15:17:01 <redacted hostname> kernel: Call Trace:

Oct 10 15:17:01 <redacted hostname> kernel: dump_stack+0x63/0x88

Oct 10 15:17:01 <redacted hostname> kernel: dump_header+0x51/0x210

Oct 10 15:17:01 <redacted hostname> kernel: oom_kill_process+0x105/0x130

Oct 10 15:17:01 <redacted hostname> kernel: out_of_memory+0x105/0x4c0

…

…

Oct 10 15:17:01 <redacted hostname> kernel: active_anon:108933472 inactive_anon:174036 isolated_anon:0#012 active_file:21875969 inactive_file:2418794 isolated_file:32#012 unevictable:88113 dirty:0 writeback:4 unstable:0#012 slab_reclaimable:3056118 slab_unreclaimable:432301#012 mapped:71768 shmem:570159 pagetables:258264 bounce:0#012 free:58924792 free_pcp:326 free_cma:0

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 active_anon:382548916kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB isolated(anon):0kB isolated(file):128kB mapped:16696kB dirty:0kB writeback:0kB shmem:578812kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 286420992kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA free:15880kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15880kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 1589 385604 385604 385604

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32 free:1535904kB min:180kB low:1780kB high:3380kB active_anon:90448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1717888kB managed:1627512kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1008kB local_pcp:248kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 384015 384015 384015

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal free:720756kB min:818928kB low:1212156kB high:1605384kB active_anon:382458300kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB writepending:0kB present:399507456kB managed:393231952kB mlocked:289840kB kernel_stack:58344kB pagetables:889796kB bounce:0kB free_pcp:296kB local_pcp:0kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 0 0 0

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32: 1*4kB (U) 1*8kB (M) 0*16kB 9*32kB (UM) 11*64kB (UM) 12*128kB (UM) 12*256kB (UM) 11*512kB (UM) 11*1024kB (M) 1*2048kB (U) 369*4096kB (M) = 1535980kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal: 76633*4kB (UME) 30442*8kB (UME) 7998*16kB (UME) 1401*32kB (UE) 6*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 723252kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

Oct 10 15:17:01 <redacted hostname> kernel: 24866489 total pagecache pages

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages in swap cache

Oct 10 15:17:01 <redacted hostname> kernel: Swap cache stats: add 0, delete 0, find 0/0

Oct 10 15:17:01 <redacted hostname> kernel: Free swap = 0kB

Oct 10 15:17:01 <redacted hostname> kernel: Total swap = 0kB

Oct 10 15:17:01 <redacted hostname> kernel: 200973631 pages RAM

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages HighMem/MovableOnly

Oct 10 15:17:01 <redacted hostname> kernel: 3165617 pages reserved

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages hwpoisoned

Oct 10 15:17:01 <redacted hostname> kernel: Tasks state (memory values in pages):

Oct 10 15:17:01 <redacted hostname> kernel: [ 2414] 0 2414 33478 20111 315392 0 0 systemd-journal

Oct 10 15:17:01 <redacted hostname> kernel: [ 2438] 0 2438 31851 540 143360 0 0 lvmetad

Oct 10 15:17:01 <redacted hostname> kernel: [ 2453] 0 2453 12284 1141 131072 0 -1000 systemd-udevd

Oct 10 15:17:01 <redacted hostname> kernel: [ 4170] 0 4170 13885 446 131072 0 -1000 auditd

Oct 10 15:17:01 <redacted hostname> kernel: [ 4393] 0 4393 5484 526 86016 0 0 irqbalance

Oct 10 15:17:01 <redacted hostname> kernel: [ 4394] 0 4394 6623 624 102400 0 0 systemd-logind

…

…

Oct 10 15:17:01 <redacted hostname> kernel: oom-kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=0,cpuset=vcpu0,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine-qemu\x2d237\x2dinstance\x2d0000fda8.scope,task=qemu-kvm,pid=25496,uid=42436

Oct 10 15:17:01 <redacted hostname> kernel: Out of memory: Killed process 25496 (qemu-kvm) total-vm:67989512kB, anon-rss:66780940kB, file-rss:11052kB, shmem-rss:4kB

Oct 10 15:17:02 <redacted hostname> kernel: oom_reaper: reaped process 25496 (qemu-kvm), now anon-rss:0kB, file-rss:36kB, shmem-rss:4kB

Eric K. Miller

5:44 p.m.

...

I would say try without "hw:numa_nodes=1" in flavor properties.

We already tested this long ago. I mentioned previously: Without the hw:numa_nodes property, an individual VM is created with its vCPUs and Memory divided between the two NUMA nodes, which is not what we would prefer. We would prefer, instead, to have all vCPUs and Memory for the VM placed into a single NUMA node so all cores of the VM have access to this NUMA node's memory instead of having one core require cross-NUMA communications.

Satish Patel

5:44 p.m.

or "hw:numa_nodes=2" to see if vm vcpu spreads to both zones. On Sat, Oct 17, 2020 at 1:41 PM Satish Patel <satish.txt@gmail.com> wrote:

...

I would say try without "hw:numa_nodes=1" in flavor properties.

~S

On Sat, Oct 17, 2020 at 1:28 PM Eric K. Miller <emiller@genesishosting.com> wrote:

...
...
What is the error thrown by Openstack when NUMA0 is full?

OOM is actually killing the QEMU process, which causes Nova to report:

/var/log/kolla/nova/nova-compute.log.4:2020-08-25 12:31:19.812 6 WARNING nova.compute.manager [req-62bddc53-ca8b-4bdc-bf41-8690fc88076f - - - - -] [instance: 8d8a262a-6e60-4e8a-97f9-14462f09b9e5] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4

So, there isn't a NUMA or memory-specific error from Nova - Nova is simply scheduling a VM on a node that it thinks has enough memory, and Libvirt (or Nova?) is configuring the VM to use CPU cores on a full NUMA node.

NUMA Node 1 had about 240GiB of free memory with about 100GiB of buffer/cache space used, so plenty of free memory, whereas NUMA Node 0 was pretty tight on free memory.

These are some logs in /var/log/messages (not for the nova-compute.log entry above, but the same condition for a VM that was killed - logs were rolled, so I had to pick a different VM):

Oct 10 15:17:01 <redacted hostname> kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0

Oct 10 15:17:01 <redacted hostname> kernel: CPU: 15 PID: 30468 Comm: CPU 0/KVM Not tainted 5.3.8-1.el7.elrepo.x86_64 #1

Oct 10 15:17:01 <redacted hostname> kernel: Hardware name: <redacted hardware>

Oct 10 15:17:01 <redacted hostname> kernel: Call Trace:

Oct 10 15:17:01 <redacted hostname> kernel: dump_stack+0x63/0x88

Oct 10 15:17:01 <redacted hostname> kernel: dump_header+0x51/0x210

Oct 10 15:17:01 <redacted hostname> kernel: oom_kill_process+0x105/0x130

Oct 10 15:17:01 <redacted hostname> kernel: out_of_memory+0x105/0x4c0

…

…

Oct 10 15:17:01 <redacted hostname> kernel: active_anon:108933472 inactive_anon:174036 isolated_anon:0#012 active_file:21875969 inactive_file:2418794 isolated_file:32#012 unevictable:88113 dirty:0 writeback:4 unstable:0#012 slab_reclaimable:3056118 slab_unreclaimable:432301#012 mapped:71768 shmem:570159 pagetables:258264 bounce:0#012 free:58924792 free_pcp:326 free_cma:0

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 active_anon:382548916kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB isolated(anon):0kB isolated(file):128kB mapped:16696kB dirty:0kB writeback:0kB shmem:578812kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 286420992kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA free:15880kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15880kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 1589 385604 385604 385604

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32 free:1535904kB min:180kB low:1780kB high:3380kB active_anon:90448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1717888kB managed:1627512kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1008kB local_pcp:248kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 384015 384015 384015

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal free:720756kB min:818928kB low:1212156kB high:1605384kB active_anon:382458300kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB writepending:0kB present:399507456kB managed:393231952kB mlocked:289840kB kernel_stack:58344kB pagetables:889796kB bounce:0kB free_pcp:296kB local_pcp:0kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 0 0 0

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32: 1*4kB (U) 1*8kB (M) 0*16kB 9*32kB (UM) 11*64kB (UM) 12*128kB (UM) 12*256kB (UM) 11*512kB (UM) 11*1024kB (M) 1*2048kB (U) 369*4096kB (M) = 1535980kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal: 76633*4kB (UME) 30442*8kB (UME) 7998*16kB (UME) 1401*32kB (UE) 6*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 723252kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

Oct 10 15:17:01 <redacted hostname> kernel: 24866489 total pagecache pages

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages in swap cache

Oct 10 15:17:01 <redacted hostname> kernel: Swap cache stats: add 0, delete 0, find 0/0

Oct 10 15:17:01 <redacted hostname> kernel: Free swap = 0kB

Oct 10 15:17:01 <redacted hostname> kernel: Total swap = 0kB

Oct 10 15:17:01 <redacted hostname> kernel: 200973631 pages RAM

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages HighMem/MovableOnly

Oct 10 15:17:01 <redacted hostname> kernel: 3165617 pages reserved

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages hwpoisoned

Oct 10 15:17:01 <redacted hostname> kernel: Tasks state (memory values in pages):

Oct 10 15:17:01 <redacted hostname> kernel: [ 2414] 0 2414 33478 20111 315392 0 0 systemd-journal

Oct 10 15:17:01 <redacted hostname> kernel: [ 2438] 0 2438 31851 540 143360 0 0 lvmetad

Oct 10 15:17:01 <redacted hostname> kernel: [ 2453] 0 2453 12284 1141 131072 0 -1000 systemd-udevd

Oct 10 15:17:01 <redacted hostname> kernel: [ 4170] 0 4170 13885 446 131072 0 -1000 auditd

Oct 10 15:17:01 <redacted hostname> kernel: [ 4393] 0 4393 5484 526 86016 0 0 irqbalance

Oct 10 15:17:01 <redacted hostname> kernel: [ 4394] 0 4394 6623 624 102400 0 0 systemd-logind

…

…

Oct 10 15:17:01 <redacted hostname> kernel: oom-kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=0,cpuset=vcpu0,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine-qemu\x2d237\x2dinstance\x2d0000fda8.scope,task=qemu-kvm,pid=25496,uid=42436

Oct 10 15:17:01 <redacted hostname> kernel: Out of memory: Killed process 25496 (qemu-kvm) total-vm:67989512kB, anon-rss:66780940kB, file-rss:11052kB, shmem-rss:4kB

Oct 10 15:17:02 <redacted hostname> kernel: oom_reaper: reaped process 25496 (qemu-kvm), now anon-rss:0kB, file-rss:36kB, shmem-rss:4kB

Sean Mooney

19 Oct 19 Oct

12:02 p.m.

sorry to top post but i was off on friday. the issue is that hw:mem_page_size has not been set. if you are using any numa feature you always need to set the mem_page_size to a value it does not matter what valid value you set it too but you need to define it in the flavor or image. if you do not then you have not activated per numa node memory tracking in nova and your vms will eventualy be killed by the OOM reaper. the minium valid numa aware vm to create is hw:mem_page_size=any that implictly expands to hw:mem_page_size=any hw:numa_nodes=1 since 1 numa node is the default if you do not set hw:numa_nodes today when we generate a numa toplogy for hw:cpu_policy=dedicated we implictly set hw:numa_node=1 effectivly internally but we do not defime hw:mem_page_size=small/any/large so if you simply defien a flaovr with hw:numa_nodes=1 or hw:cpu_policy=dedicated and no other extra specs then technically that is an invalid flavor for the libvirt driver. hw:numa_nodes=1 is vaild for the hyperv driver on its own but not for the libvirt driver. if you are using any numa featuer with the libvirt driver hw:mem_page_size in the falvor or hw_mem_page_size in the image must be set for nova to correctly track and allocate memory for the vm. Sat, 2020-10-17 at 13:44 -0400, Satish Patel wrote:

...

or "hw:numa_nodes=2" to see if vm vcpu spreads to both zones.

On Sat, Oct 17, 2020 at 1:41 PM Satish Patel <satish.txt@gmail.com> wrote:

...
I would say try without "hw:numa_nodes=1" in flavor properties.

~S

On Sat, Oct 17, 2020 at 1:28 PM Eric K. Miller <emiller@genesishosting.com> wrote:

...
...
What is the error thrown by Openstack when NUMA0 is full?

OOM is actually killing the QEMU process, which causes Nova to report:

/var/log/kolla/nova/nova-compute.log.4:2020-08-25 12:31:19.812 6 WARNING nova.compute.manager [req-62bddc53-ca8b-4bdc-bf41-8690fc88076f - - - - -] [instance: 8d8a262a-6e60-4e8a-97f9-14462f09b9e5] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4

So, there isn't a NUMA or memory-specific error from Nova - Nova is simply scheduling a VM on a node that it thinks has enough memory, and Libvirt (or Nova?) is configuring the VM to use CPU cores on a full NUMA node.

NUMA Node 1 had about 240GiB of free memory with about 100GiB of buffer/cache space used, so plenty of free memory, whereas NUMA Node 0 was pretty tight on free memory.

These are some logs in /var/log/messages (not for the nova-compute.log entry above, but the same condition for a VM that was killed - logs were rolled, so I had to pick a different VM):

Oct 10 15:17:01 <redacted hostname> kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0

Oct 10 15:17:01 <redacted hostname> kernel: CPU: 15 PID: 30468 Comm: CPU 0/KVM Not tainted 5.3.8-1.el7.elrepo.x86_64 #1

Oct 10 15:17:01 <redacted hostname> kernel: Hardware name: <redacted hardware>

Oct 10 15:17:01 <redacted hostname> kernel: Call Trace:

Oct 10 15:17:01 <redacted hostname> kernel: dump_stack+0x63/0x88

Oct 10 15:17:01 <redacted hostname> kernel: dump_header+0x51/0x210

Oct 10 15:17:01 <redacted hostname> kernel: oom_kill_process+0x105/0x130

Oct 10 15:17:01 <redacted hostname> kernel: out_of_memory+0x105/0x4c0

…

…

Oct 10 15:17:01 <redacted hostname> kernel: active_anon:108933472 inactive_anon:174036 isolated_anon:0#012 active_file:21875969 inactive_file:2418794 isolated_file:32#012 unevictable:88113 dirty:0 writeback:4 unstable:0#012 slab_reclaimable:3056118 slab_unreclaimable:432301#012 mapped:71768 shmem:570159 pagetables:258264 bounce:0#012 free:58924792 free_pcp:326 free_cma:0

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 active_anon:382548916kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB isolated(anon):0kB isolated(file):128kB mapped:16696kB dirty:0kB writeback:0kB shmem:578812kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 286420992kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA free:15880kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15880kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 1589 385604 385604 385604

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32 free:1535904kB min:180kB low:1780kB high:3380kB active_anon:90448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1717888kB managed:1627512kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1008kB local_pcp:248kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 384015 384015 384015

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal free:720756kB min:818928kB low:1212156kB high:1605384kB active_anon:382458300kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB writepending:0kB present:399507456kB managed:393231952kB mlocked:289840kB kernel_stack:58344kB pagetables:889796kB bounce:0kB free_pcp:296kB local_pcp:0kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 0 0 0

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32: 1*4kB (U) 1*8kB (M) 0*16kB 9*32kB (UM) 11*64kB (UM) 12*128kB (UM) 12*256kB (UM) 11*512kB (UM) 11*1024kB (M) 1*2048kB (U) 369*4096kB (M) = 1535980kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal: 76633*4kB (UME) 30442*8kB (UME) 7998*16kB (UME) 1401*32kB (UE) 6*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 723252kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

Oct 10 15:17:01 <redacted hostname> kernel: 24866489 total pagecache pages

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages in swap cache

Oct 10 15:17:01 <redacted hostname> kernel: Swap cache stats: add 0, delete 0, find 0/0

Oct 10 15:17:01 <redacted hostname> kernel: Free swap = 0kB

Oct 10 15:17:01 <redacted hostname> kernel: Total swap = 0kB

Oct 10 15:17:01 <redacted hostname> kernel: 200973631 pages RAM

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages HighMem/MovableOnly

Oct 10 15:17:01 <redacted hostname> kernel: 3165617 pages reserved

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages hwpoisoned

Oct 10 15:17:01 <redacted hostname> kernel: Tasks state (memory values in pages):

Oct 10 15:17:01 <redacted hostname> kernel: [   2414]     0 2414    33478    20111   315392        0             0 systemd-journal

Oct 10 15:17:01 <redacted hostname> kernel: [   2438]     0 2438    31851      540   143360        0             0 lvmetad

Oct 10 15:17:01 <redacted hostname> kernel: [   2453]     0 2453    12284     1141   131072        0         -1000 systemd-udevd

Oct 10 15:17:01 <redacted hostname> kernel: [   4170]     0 4170    13885      446   131072        0         -1000 auditd

Oct 10 15:17:01 <redacted hostname> kernel: [   4393]     0 4393     5484      526    86016        0             0 irqbalance

Oct 10 15:17:01 <redacted hostname> kernel: [   4394]     0 4394     6623      624   102400        0             0 systemd-logind

…

…

Oct 10 15:17:01 <redacted hostname> kernel: oom- kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=0,cpuset=vcpu0,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine- qemu\x2d237\x2dinstance\x2d0000fda8.scope,task=qemu-kvm,pid=25496,uid=42436

Oct 10 15:17:01 <redacted hostname> kernel: Out of memory: Killed process 25496 (qemu-kvm) total-vm:67989512kB, anon-rss:66780940kB, file- rss:11052kB, shmem-rss:4kB

Oct 10 15:17:02 <redacted hostname> kernel: oom_reaper: reaped process 25496 (qemu-kvm), now anon-rss:0kB, file-rss:36kB, shmem-rss:4kB

Satish Patel

1:25 p.m.

Sean, Awesome write up, It would be great to have this explanation on official website at here https://docs.openstack.org/nova/pike/admin/cpu-topologies.html ~S On Mon, Oct 19, 2020 at 8:02 AM Sean Mooney <smooney@redhat.com> wrote:

...

sorry to top post but i was off on friday. the issue is that hw:mem_page_size has not been set.

if you are using any numa feature you always need to set the mem_page_size to a value it does not matter what valid value you set it too but you need to define it in the flavor or image. if you do not then you have not activated per numa node memory tracking in nova and your vms will eventualy be killed by the OOM reaper.

the minium valid numa aware vm to create is hw:mem_page_size=any

that implictly expands to hw:mem_page_size=any hw:numa_nodes=1

since 1 numa node is the default if you do not set hw:numa_nodes

today when we generate a numa toplogy for hw:cpu_policy=dedicated we implictly set hw:numa_node=1 effectivly internally but we do not defime hw:mem_page_size=small/any/large

so if you simply defien a flaovr with hw:numa_nodes=1 or hw:cpu_policy=dedicated and no other extra specs then technically that is an invalid flavor for the libvirt driver.

hw:numa_nodes=1 is vaild for the hyperv driver on its own but not for the libvirt driver.

if you are using any numa featuer with the libvirt driver hw:mem_page_size in the falvor or hw_mem_page_size in the image must be set for nova to correctly track and allocate memory for the vm.

Sat, 2020-10-17 at 13:44 -0400, Satish Patel wrote:

...
or "hw:numa_nodes=2" to see if vm vcpu spreads to both zones.

On Sat, Oct 17, 2020 at 1:41 PM Satish Patel <satish.txt@gmail.com> wrote:

...
I would say try without "hw:numa_nodes=1" in flavor properties.

~S

On Sat, Oct 17, 2020 at 1:28 PM Eric K. Miller <emiller@genesishosting.com> wrote:

...
...
What is the error thrown by Openstack when NUMA0 is full?

OOM is actually killing the QEMU process, which causes Nova to report:

/var/log/kolla/nova/nova-compute.log.4:2020-08-25 12:31:19.812 6 WARNING nova.compute.manager [req-62bddc53-ca8b-4bdc-bf41-8690fc88076f - - - - -] [instance: 8d8a262a-6e60-4e8a-97f9-14462f09b9e5] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4

So, there isn't a NUMA or memory-specific error from Nova - Nova is simply scheduling a VM on a node that it thinks has enough memory, and Libvirt (or Nova?) is configuring the VM to use CPU cores on a full NUMA node.

NUMA Node 1 had about 240GiB of free memory with about 100GiB of buffer/cache space used, so plenty of free memory, whereas NUMA Node 0 was pretty tight on free memory.

These are some logs in /var/log/messages (not for the nova-compute.log entry above, but the same condition for a VM that was killed - logs were rolled, so I had to pick a different VM):

Oct 10 15:17:01 <redacted hostname> kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0

Oct 10 15:17:01 <redacted hostname> kernel: CPU: 15 PID: 30468 Comm: CPU 0/KVM Not tainted 5.3.8-1.el7.elrepo.x86_64 #1

Oct 10 15:17:01 <redacted hostname> kernel: Hardware name: <redacted hardware>

Oct 10 15:17:01 <redacted hostname> kernel: Call Trace:

Oct 10 15:17:01 <redacted hostname> kernel: dump_stack+0x63/0x88

Oct 10 15:17:01 <redacted hostname> kernel: dump_header+0x51/0x210

Oct 10 15:17:01 <redacted hostname> kernel: oom_kill_process+0x105/0x130

Oct 10 15:17:01 <redacted hostname> kernel: out_of_memory+0x105/0x4c0

…

…

Oct 10 15:17:01 <redacted hostname> kernel: active_anon:108933472 inactive_anon:174036 isolated_anon:0#012 active_file:21875969 inactive_file:2418794 isolated_file:32#012 unevictable:88113 dirty:0 writeback:4 unstable:0#012 slab_reclaimable:3056118 slab_unreclaimable:432301#012 mapped:71768 shmem:570159 pagetables:258264 bounce:0#012 free:58924792 free_pcp:326 free_cma:0

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 active_anon:382548916kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB isolated(anon):0kB isolated(file):128kB mapped:16696kB dirty:0kB writeback:0kB shmem:578812kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 286420992kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA free:15880kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15880kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 1589 385604 385604 385604

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32 free:1535904kB min:180kB low:1780kB high:3380kB active_anon:90448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1717888kB managed:1627512kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1008kB local_pcp:248kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 384015 384015 384015

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal free:720756kB min:818928kB low:1212156kB high:1605384kB active_anon:382458300kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB writepending:0kB present:399507456kB managed:393231952kB mlocked:289840kB kernel_stack:58344kB pagetables:889796kB bounce:0kB free_pcp:296kB local_pcp:0kB free_cma:0kB

Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 0 0 0

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32: 1*4kB (U) 1*8kB (M) 0*16kB 9*32kB (UM) 11*64kB (UM) 12*128kB (UM) 12*256kB (UM) 11*512kB (UM) 11*1024kB (M) 1*2048kB (U) 369*4096kB (M) = 1535980kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal: 76633*4kB (UME) 30442*8kB (UME) 7998*16kB (UME) 1401*32kB (UE) 6*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 723252kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

Oct 10 15:17:01 <redacted hostname> kernel: 24866489 total pagecache pages

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages in swap cache

Oct 10 15:17:01 <redacted hostname> kernel: Swap cache stats: add 0, delete 0, find 0/0

Oct 10 15:17:01 <redacted hostname> kernel: Free swap = 0kB

Oct 10 15:17:01 <redacted hostname> kernel: Total swap = 0kB

Oct 10 15:17:01 <redacted hostname> kernel: 200973631 pages RAM

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages HighMem/MovableOnly

Oct 10 15:17:01 <redacted hostname> kernel: 3165617 pages reserved

Oct 10 15:17:01 <redacted hostname> kernel: 0 pages hwpoisoned

Oct 10 15:17:01 <redacted hostname> kernel: Tasks state (memory values in pages):

Oct 10 15:17:01 <redacted hostname> kernel: [ 2414] 0 2414 33478 20111 315392 0 0 systemd-journal

Oct 10 15:17:01 <redacted hostname> kernel: [ 2438] 0 2438 31851 540 143360 0 0 lvmetad

Oct 10 15:17:01 <redacted hostname> kernel: [ 2453] 0 2453 12284 1141 131072 0 -1000 systemd-udevd

Oct 10 15:17:01 <redacted hostname> kernel: [ 4170] 0 4170 13885 446 131072 0 -1000 auditd

Oct 10 15:17:01 <redacted hostname> kernel: [ 4393] 0 4393 5484 526 86016 0 0 irqbalance

Oct 10 15:17:01 <redacted hostname> kernel: [ 4394] 0 4394 6623 624 102400 0 0 systemd-logind

…

…

Oct 10 15:17:01 <redacted hostname> kernel: oom- kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=0,cpuset=vcpu0,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine- qemu\x2d237\x2dinstance\x2d0000fda8.scope,task=qemu-kvm,pid=25496,uid=42436

Oct 10 15:17:01 <redacted hostname> kernel: Out of memory: Killed process 25496 (qemu-kvm) total-vm:67989512kB, anon-rss:66780940kB, file- rss:11052kB, shmem-rss:4kB

Oct 10 15:17:02 <redacted hostname> kernel: oom_reaper: reaped process 25496 (qemu-kvm), now anon-rss:0kB, file-rss:36kB, shmem-rss:4kB

Sean Mooney

11:53 a.m.

...

We have been running with NUMA configured for a long time and don't believe I have seen this behavior. It's important that you configure the flavors / aggregates correct.

I think this might be what you are looking for

penstack flavor set m1.large --property hw:cpu_policy=dedicated

https://docs.openstack.org/nova/pike/admin/cpu-topologies.html no this is not what is needed that enabled cpu pinning not numa waware memory allocation.

On Sat, 2020-10-17 at 04:04 +0000, Erik Olof Gunnar Andersson wrote: that implictly create a numa toplogy of 1 numa node if you do not override that default by setting hw:numa_nodes=1

...

Pretty sure we also set this for any flavor that only requires a single NUMA zone

openstack flavor set m1.large --property hw:numa_nodes=1

this is how you specify the numa of guest numa nodes yes but hw:mem_page_size is what is missing to enable numa aware memory tracking. without hw:mem_page_size or the image version hw_mem_page_size nova will use the gloably free memeory on the host when determinging if it can boot a vm. this is a knonw behavor since numa was intoduced as a reuslt of the design choice to make numa affinty opt in. as a result of coustomer misconfiguring there system lately i had filed https://bugs.launchpad.net/nova/+bug/1893121 to adres some of this behavior but it was determined to be a feature since we discused this when first adding numa and declared it out of scope. once change i have broght up in the past and i might raise it at the ptg again is the idea of default ing to hw:mem_page_size=any if you have a numa vm and dont otherwise set it. that would stop the behaivor being descibed here but it will mean you cannot do memory oversubstribption with numa guests. i have long held the view that usign numa affintiy and memory oversubsription was mutally exclusive and we should just default to making this work out of the box for people but that si why we have not made this change the last 3-4 time i have raised it at the ptg/devsummit. this is a change we will have to make if we track numa in plamcent in anycase.

...

________________________________ From: Eric K. Miller <emiller@genesishosting.com> Sent: Friday, October 16, 2020 8:47 PM To: Laurent Dumont <laurentfdumont@gmail.com> Cc: openstack-discuss <openstack-discuss@lists.openstack.org> Subject: RE: [nova] NUMA scheduling

...
As far as I know, numa_nodes=1 just means --> the resources for that VM should run on one NUMA node (so either NUMA0 or NUMA1). If there is space free on both, then it's probably going to pick one of the two?

I thought the same, but it appears that VMs are never scheduled on NUMA1 even though NUMA0 is full (causing OOM to trigger and kill running VMs). I would have hoped that a NUMA node was treated like a host, and thus "VMs being balanced across nodes".

The discussion on NUMA handling is long, so I was hoping that there might be information about the latest solution to the problem - or to be told that there isn't a good solution other than using huge pages.

Eric

Sean Mooney

11:45 a.m.

On Fri, 2020-10-16 at 22:47 -0500, Eric K. Miller wrote:

...

...
As far as I know, numa_nodes=1 just means --> the resources for that VM should run on one NUMA node (so either NUMA0 or NUMA1). If there is space free on both, then it's probably going to pick one of the two?

I thought the same, but it appears that VMs are never scheduled on NUMA1 even though NUMA0 is full (causing OOM to trigger and kill running VMs). I would have hoped that a NUMA node was treated like a host, and thus "VMs being balanced across nodes". hw:numa_nodes=1 does not enable per numa node memory tracking

to resolve your OOM issue you need to set hw:mem_page_size=small or hw:mem_page_size=any the reason that it is always selection numa 0 is that nova is taking the list of host numa nodes and checkign each one using itertools.permutations. that always checks the numa nodes in a stable order starting with numa node 0. since you have jsut set hw:numa_nodes=1 without requesting any numa specifc resouces e.g. memory or cpus numa node 0 will effectivly always fit the vm when you set hw:numa_nodes=1 and nothing else the scudler will only reject a node if the number of cpus on the numa node is less that the number the vm requests. it will not check the memory availabel on the numa node since you did not ask nova to do that via hw:mem_page_size effectivly if you are using any numa feature in nova and do not set hw:mem_page_size then your flavor is misconfigured as it will not request numa local memory trackign to be enabled.

...

The discussion on NUMA handling is long, so I was hoping that there might be information about the latest solution to the problem - or to be told that there isn't a good solution other than using huge pages.

you do not need to use hugepages but you do need to enable per numa node memory tracking with hw:mem_page_size=small (use non hugepage typicaly 4k pages) or hw:mem_page_size=any which basicaly is the same as small but the image can requst hugepages if it wantws too. if you set small in the flaor but large in the image that is an error. if you set any in the falvor the image can set any value it like like small or large or an explcit page size and the schduler will honour that. if you know you wnat the flavor to use small pages then you shoudl just set small explictly.

...

Eric

Eric K. Miller

20 Oct 20 Oct

1:38 a.m.

...

hw:numa_nodes=1 does not enable per numa node memory tracking to resolve your OOM issue you need to set hw:mem_page_size=small or hw:mem_page_size=any

Ah! That's what I was looking for! :) Thank you Sean!

...

the reason that it is always selection numa 0 is that nova is taking the list of host numa nodes and checkign each one using itertools.permutations. that always checks the numa nodes in a stable order starting with numa node 0.

since you have jsut set hw:numa_nodes=1 without requesting any numa specifc resouces e.g. memory or cpus numa node 0 will effectivly always fit the vm

Makes sense.

...

when you set hw:numa_nodes=1 and nothing else the scudler will only reject a node if the number of cpus on the numa node is less that the number the vm requests. it will not check the memory availabel on the numa node since you did not ask nova to do that via hw:mem_page_size

effectivly if you are using any numa feature in nova and do not set hw:mem_page_size then your flavor is misconfigured as it will not request numa local memory trackign to be enabled.

Good to know. So it sounds like by setting the hw:mem_page_size parameter (probably best to choose "small" as a general default), NUMA node 0 will fill up, and then NUMA node 1 will be considered. In other words, VMs will NOT be provisioned in a "round-robin" fashion between NUMA nodes. Do I understand that correctly?

...

you do not need to use hugepages but you do need to enable per numa node memory tracking with hw:mem_page_size=small (use non hugepage typicaly 4k pages) or hw:mem_page_size=any which basicaly is the same as small but the image can requst hugepages if it wantws too. if you set small in the flaor but large in the image that is an error. if you set any in the falvor the image can set any value it like like small or large or an explcit page size and the schduler will honour that.

if you know you wnat the flavor to use small pages then you shoudl just set small explictly.

Also good to know. Thanks again! Eric

Sean Mooney

4:53 a.m.

On Mon, 2020-10-19 at 20:38 -0500, Eric K. Miller wrote:

...

...
hw:numa_nodes=1 does not enable per numa node memory tracking to resolve your OOM issue you need to set hw:mem_page_size=small or hw:mem_page_size=any

Ah! That's what I was looking for! :) Thank you Sean!

...
the reason that it is always selection numa 0 is that nova is taking the list of host numa nodes and checkign each one using itertools.permutations. that always checks the numa nodes in a stable order starting with numa node 0.

since you have jsut set hw:numa_nodes=1 without requesting any numa specifc resouces e.g. memory or cpus numa node 0 will effectivly always fit the vm

Makes sense.

...
when you set hw:numa_nodes=1 and nothing else the scudler will only reject a node if the number of cpus on the numa node is less that the number the vm requests. it will not check the memory availabel on the numa node since you did not ask nova to do that via hw:mem_page_size

effectivly if you are using any numa feature in nova and do not set hw:mem_page_size then your flavor is misconfigured as it will not request numa local memory trackign to be enabled.

Good to know.

So it sounds like by setting the hw:mem_page_size parameter (probably best to choose "small" as a general default), NUMA node 0 will fill up, and then NUMA node 1 will be considered. In other words, VMs will NOT be provisioned in a "round-robin" fashion between NUMA nodes. Do I understand that correctly?

yes you do https://bugs.launchpad.net/nova/+bug/1893121 basically tracks this. i fundemetally belive this is a performance bug not a feature althogh others disagree. This is why we have always recommended you set hw:numa_nodes to the number of numa nodes on the host if you can. the excpetion to that is for workloads that dont support numa awareness in which case you shoudl only deviate form this advice if you mesusre a perfromace degreadation. with the default behavior you will saturate once numa node before usign the other this lead to pessimising your memory badnwidth and cpu perfromacne. tl;dr since all the vms are being packed on the first numa node/socket it will effectivly leave the scond socket ideal and load up the first socket causeing the process to not be able to turbo bost as agressivly as if you had spread the vms between each numa node evenly. as you get to higher utilisation that is less of an issue but its a non zero effect on lightly utilised cloud sicne we spread host by default and is potentially less energy effincet as the termal load will also not be spread between the cpus.

...

...
you do not need to use hugepages but you do need to enable per numa node memory tracking with hw:mem_page_size=small (use non hugepage typicaly 4k pages) or hw:mem_page_size=any which basicaly is the same as small but the image can requst hugepages if it wantws too. if you set small in the flaor but large in the image that is an error. if you set any in the falvor the image can set any value it like like small or large or an explcit page size and the schduler will honour that.

if you know you wnat the flavor to use small pages then you shoudl just set small explictly.

Also good to know. Thanks again!

i have added this topic to the ptg etherpad in https://etherpad.opendev.org/p/nova-wallaby-ptg line 170 ish as part of the numa in placement section.

...

Eric

1837

Age (days ago)

1841

Last active (days ago)

List overview

Download

17 comments

5 participants

participants (5)

Eric K. Miller
Erik Olof Gunnar Andersson
Laurent Dumont
Satish Patel
Sean Mooney