sorry to top post but i was off on friday. the issue is that hw:mem_page_size has not been set. if you are using any numa feature you always need to set the mem_page_size to a value it does not matter what valid value you set it too but you need to define it in the flavor or image. if you do not then you have not activated per numa node memory tracking in nova and your vms will eventualy be killed by the OOM reaper. the minium valid numa aware vm to create is hw:mem_page_size=any that implictly expands to hw:mem_page_size=any hw:numa_nodes=1 since 1 numa node is the default if you do not set hw:numa_nodes today when we generate a numa toplogy for hw:cpu_policy=dedicated we implictly set hw:numa_node=1 effectivly internally but we do not defime hw:mem_page_size=small/any/large so if you simply defien a flaovr with hw:numa_nodes=1 or hw:cpu_policy=dedicated and no other extra specs then technically that is an invalid flavor for the libvirt driver. hw:numa_nodes=1 is vaild for the hyperv driver on its own but not for the libvirt driver. if you are using any numa featuer with the libvirt driver hw:mem_page_size in the falvor or hw_mem_page_size in the image must be set for nova to correctly track and allocate memory for the vm. Sat, 2020-10-17 at 13:44 -0400, Satish Patel wrote:
or "hw:numa_nodes=2" to see if vm vcpu spreads to both zones.
On Sat, Oct 17, 2020 at 1:41 PM Satish Patel <satish.txt@gmail.com> wrote:
I would say try without "hw:numa_nodes=1" in flavor properties.
~S
On Sat, Oct 17, 2020 at 1:28 PM Eric K. Miller <emiller@genesishosting.com> wrote:
What is the error thrown by Openstack when NUMA0 is full?
OOM is actually killing the QEMU process, which causes Nova to report:
/var/log/kolla/nova/nova-compute.log.4:2020-08-25 12:31:19.812 6 WARNING nova.compute.manager [req-62bddc53-ca8b-4bdc-bf41-8690fc88076f - - - - -] [instance: 8d8a262a-6e60-4e8a-97f9-14462f09b9e5] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4
So, there isn't a NUMA or memory-specific error from Nova - Nova is simply scheduling a VM on a node that it thinks has enough memory, and Libvirt (or Nova?) is configuring the VM to use CPU cores on a full NUMA node.
NUMA Node 1 had about 240GiB of free memory with about 100GiB of buffer/cache space used, so plenty of free memory, whereas NUMA Node 0 was pretty tight on free memory.
These are some logs in /var/log/messages (not for the nova-compute.log entry above, but the same condition for a VM that was killed - logs were rolled, so I had to pick a different VM):
Oct 10 15:17:01 <redacted hostname> kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
Oct 10 15:17:01 <redacted hostname> kernel: CPU: 15 PID: 30468 Comm: CPU 0/KVM Not tainted 5.3.8-1.el7.elrepo.x86_64 #1
Oct 10 15:17:01 <redacted hostname> kernel: Hardware name: <redacted hardware>
Oct 10 15:17:01 <redacted hostname> kernel: Call Trace:
Oct 10 15:17:01 <redacted hostname> kernel: dump_stack+0x63/0x88
Oct 10 15:17:01 <redacted hostname> kernel: dump_header+0x51/0x210
Oct 10 15:17:01 <redacted hostname> kernel: oom_kill_process+0x105/0x130
Oct 10 15:17:01 <redacted hostname> kernel: out_of_memory+0x105/0x4c0
…
…
Oct 10 15:17:01 <redacted hostname> kernel: active_anon:108933472 inactive_anon:174036 isolated_anon:0#012 active_file:21875969 inactive_file:2418794 isolated_file:32#012 unevictable:88113 dirty:0 writeback:4 unstable:0#012 slab_reclaimable:3056118 slab_unreclaimable:432301#012 mapped:71768 shmem:570159 pagetables:258264 bounce:0#012 free:58924792 free_pcp:326 free_cma:0
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 active_anon:382548916kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB isolated(anon):0kB isolated(file):128kB mapped:16696kB dirty:0kB writeback:0kB shmem:578812kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 286420992kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA free:15880kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15880kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 1589 385604 385604 385604
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32 free:1535904kB min:180kB low:1780kB high:3380kB active_anon:90448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1717888kB managed:1627512kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1008kB local_pcp:248kB free_cma:0kB
Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 384015 384015 384015
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal free:720756kB min:818928kB low:1212156kB high:1605384kB active_anon:382458300kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB writepending:0kB present:399507456kB managed:393231952kB mlocked:289840kB kernel_stack:58344kB pagetables:889796kB bounce:0kB free_pcp:296kB local_pcp:0kB free_cma:0kB
Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 0 0 0
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32: 1*4kB (U) 1*8kB (M) 0*16kB 9*32kB (UM) 11*64kB (UM) 12*128kB (UM) 12*256kB (UM) 11*512kB (UM) 11*1024kB (M) 1*2048kB (U) 369*4096kB (M) = 1535980kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal: 76633*4kB (UME) 30442*8kB (UME) 7998*16kB (UME) 1401*32kB (UE) 6*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 723252kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 10 15:17:01 <redacted hostname> kernel: 24866489 total pagecache pages
Oct 10 15:17:01 <redacted hostname> kernel: 0 pages in swap cache
Oct 10 15:17:01 <redacted hostname> kernel: Swap cache stats: add 0, delete 0, find 0/0
Oct 10 15:17:01 <redacted hostname> kernel: Free swap = 0kB
Oct 10 15:17:01 <redacted hostname> kernel: Total swap = 0kB
Oct 10 15:17:01 <redacted hostname> kernel: 200973631 pages RAM
Oct 10 15:17:01 <redacted hostname> kernel: 0 pages HighMem/MovableOnly
Oct 10 15:17:01 <redacted hostname> kernel: 3165617 pages reserved
Oct 10 15:17:01 <redacted hostname> kernel: 0 pages hwpoisoned
Oct 10 15:17:01 <redacted hostname> kernel: Tasks state (memory values in pages):
Oct 10 15:17:01 <redacted hostname> kernel: [ 2414] 0 2414 33478 20111 315392 0 0 systemd-journal
Oct 10 15:17:01 <redacted hostname> kernel: [ 2438] 0 2438 31851 540 143360 0 0 lvmetad
Oct 10 15:17:01 <redacted hostname> kernel: [ 2453] 0 2453 12284 1141 131072 0 -1000 systemd-udevd
Oct 10 15:17:01 <redacted hostname> kernel: [ 4170] 0 4170 13885 446 131072 0 -1000 auditd
Oct 10 15:17:01 <redacted hostname> kernel: [ 4393] 0 4393 5484 526 86016 0 0 irqbalance
Oct 10 15:17:01 <redacted hostname> kernel: [ 4394] 0 4394 6623 624 102400 0 0 systemd-logind
…
…
Oct 10 15:17:01 <redacted hostname> kernel: oom- kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=0,cpuset=vcpu0,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine- qemu\x2d237\x2dinstance\x2d0000fda8.scope,task=qemu-kvm,pid=25496,uid=42436
Oct 10 15:17:01 <redacted hostname> kernel: Out of memory: Killed process 25496 (qemu-kvm) total-vm:67989512kB, anon-rss:66780940kB, file- rss:11052kB, shmem-rss:4kB
Oct 10 15:17:02 <redacted hostname> kernel: oom_reaper: reaped process 25496 (qemu-kvm), now anon-rss:0kB, file-rss:36kB, shmem-rss:4kB