[nova] NUMA scheduling
Eric K. Miller
emiller at genesishosting.com
Sat Oct 17 17:18:43 UTC 2020
> What is the error thrown by Openstack when NUMA0 is full?
OOM is actually killing the QEMU process, which causes Nova to report:
/var/log/kolla/nova/nova-compute.log.4:2020-08-25 12:31:19.812 6 WARNING nova.compute.manager [req-62bddc53-ca8b-4bdc-bf41-8690fc88076f - - - - -] [instance: 8d8a262a-6e60-4e8a-97f9-14462f09b9e5] Instance shutdown by itself. Calling the stop API. Current vm_state: active, current task_state: None, original DB power_state: 1, current VM power_state: 4
So, there isn't a NUMA or memory-specific error from Nova - Nova is simply scheduling a VM on a node that it thinks has enough memory, and Libvirt (or Nova?) is configuring the VM to use CPU cores on a full NUMA node.
NUMA Node 1 had about 240GiB of free memory with about 100GiB of buffer/cache space used, so plenty of free memory, whereas NUMA Node 0 was pretty tight on free memory.
These are some logs in /var/log/messages (not for the nova-compute.log entry above, but the same condition for a VM that was killed - logs were rolled, so I had to pick a different VM):
Oct 10 15:17:01 <redacted hostname> kernel: CPU 0/KVM invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
Oct 10 15:17:01 <redacted hostname> kernel: CPU: 15 PID: 30468 Comm: CPU 0/KVM Not tainted 5.3.8-1.el7.elrepo.x86_64 #1
Oct 10 15:17:01 <redacted hostname> kernel: Hardware name: <redacted hardware>
Oct 10 15:17:01 <redacted hostname> kernel: Call Trace:
Oct 10 15:17:01 <redacted hostname> kernel: dump_stack+0x63/0x88
Oct 10 15:17:01 <redacted hostname> kernel: dump_header+0x51/0x210
Oct 10 15:17:01 <redacted hostname> kernel: oom_kill_process+0x105/0x130
Oct 10 15:17:01 <redacted hostname> kernel: out_of_memory+0x105/0x4c0
…
…
Oct 10 15:17:01 <redacted hostname> kernel: active_anon:108933472 inactive_anon:174036 isolated_anon:0#012 active_file:21875969 inactive_file:2418794 isolated_file:32#012 unevictable:88113 dirty:0 writeback:4 unstable:0#012 slab_reclaimable:3056118 slab_unreclaimable:432301#012 mapped:71768 shmem:570159 pagetables:258264 bounce:0#012 free:58924792 free_pcp:326 free_cma:0
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 active_anon:382548916kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB isolated(anon):0kB isolated(file):128kB mapped:16696kB dirty:0kB writeback:0kB shmem:578812kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 286420992kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA free:15880kB min:0kB low:12kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15880kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 1589 385604 385604 385604
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32 free:1535904kB min:180kB low:1780kB high:3380kB active_anon:90448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1717888kB managed:1627512kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:1008kB local_pcp:248kB free_cma:0kB
Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 384015 384015 384015
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal free:720756kB min:818928kB low:1212156kB high:1605384kB active_anon:382458300kB inactive_anon:173052kB active_file:0kB inactive_file:2272kB unevictable:289840kB writepending:0kB present:399507456kB managed:393231952kB mlocked:289840kB kernel_stack:58344kB pagetables:889796kB bounce:0kB free_pcp:296kB local_pcp:0kB free_cma:0kB
Oct 10 15:17:01 <redacted hostname> kernel: lowmem_reserve[]: 0 0 0 0 0
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 DMA32: 1*4kB (U) 1*8kB (M) 0*16kB 9*32kB (UM) 11*64kB (UM) 12*128kB (UM) 12*256kB (UM) 11*512kB (UM) 11*1024kB (M) 1*2048kB (U) 369*4096kB (M) = 1535980kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 Normal: 76633*4kB (UME) 30442*8kB (UME) 7998*16kB (UME) 1401*32kB (UE) 6*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 723252kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Oct 10 15:17:01 <redacted hostname> kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Oct 10 15:17:01 <redacted hostname> kernel: 24866489 total pagecache pages
Oct 10 15:17:01 <redacted hostname> kernel: 0 pages in swap cache
Oct 10 15:17:01 <redacted hostname> kernel: Swap cache stats: add 0, delete 0, find 0/0
Oct 10 15:17:01 <redacted hostname> kernel: Free swap = 0kB
Oct 10 15:17:01 <redacted hostname> kernel: Total swap = 0kB
Oct 10 15:17:01 <redacted hostname> kernel: 200973631 pages RAM
Oct 10 15:17:01 <redacted hostname> kernel: 0 pages HighMem/MovableOnly
Oct 10 15:17:01 <redacted hostname> kernel: 3165617 pages reserved
Oct 10 15:17:01 <redacted hostname> kernel: 0 pages hwpoisoned
Oct 10 15:17:01 <redacted hostname> kernel: Tasks state (memory values in pages):
Oct 10 15:17:01 <redacted hostname> kernel: [ 2414] 0 2414 33478 20111 315392 0 0 systemd-journal
Oct 10 15:17:01 <redacted hostname> kernel: [ 2438] 0 2438 31851 540 143360 0 0 lvmetad
Oct 10 15:17:01 <redacted hostname> kernel: [ 2453] 0 2453 12284 1141 131072 0 -1000 systemd-udevd
Oct 10 15:17:01 <redacted hostname> kernel: [ 4170] 0 4170 13885 446 131072 0 -1000 auditd
Oct 10 15:17:01 <redacted hostname> kernel: [ 4393] 0 4393 5484 526 86016 0 0 irqbalance
Oct 10 15:17:01 <redacted hostname> kernel: [ 4394] 0 4394 6623 624 102400 0 0 systemd-logind
…
…
Oct 10 15:17:01 <redacted hostname> kernel: oom-kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=0,cpuset=vcpu0,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine-qemu\x2d237\x2dinstance\x2d0000fda8.scope,task=qemu-kvm,pid=25496,uid=42436
Oct 10 15:17:01 <redacted hostname> kernel: Out of memory: Killed process 25496 (qemu-kvm) total-vm:67989512kB, anon-rss:66780940kB, file-rss:11052kB, shmem-rss:4kB
Oct 10 15:17:02 <redacted hostname> kernel: oom_reaper: reaped process 25496 (qemu-kvm), now anon-rss:0kB, file-rss:36kB, shmem-rss:4kB
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20201017/5f9efa58/attachment-0001.html>
More information about the openstack-discuss
mailing list