issues creating a second vm with numa affinity
Dear Openstack user community, I have a compute node with 2 numa nodes and I would like to create 2 vms, each one using a different numa node through numa affinity with cpu, memory and nvme pci devices. Openstack flavor openstack flavor create --public xlarge.numa.perf.test --ram 200000 --disk 700 --vcpus 20 --property hw:cpu_policy=dedicated --property hw:emulator_threads_policy=isolate --property hw:numa_nodes='1' --property pci_passthrough:alias='nvme:4' The first vm is successfully created openstack server create --network hpc --flavor xlarge.numa.perf.test --image centos7.6-image --availability-zone nova:zeus-53.localdomain --key-name mykey kudu-1 However the second vm fails openstack server create --network hpc --flavor xlarge.numa.perf --image centos7.6-kudu-image --availability-zone nova:zeus-53.localdomain --key-name mykey kudu-4 Errors in nova compute node 2019-09-27 16:45:19.785 7 ERROR nova.compute.manager [req-b5a25c73-8c7d-466c-8128-71f29e7ae8aa 91e83343e9834c8ba0172ff369c8acac b91520cff5bd45c59a8de07c38641582 - default default] [instance: ebe4e78c-501e-4535-ae15-948301cbf1ae] Instance failed to spawn: libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-09-27T06:45:19.118089Z qemu-kvm: kvm_init_vcpu failed: Cannot allocate memory 2019-09-27 16:45:19.785 7 ERROR nova.compute.manager [instance: ebe4e78c-501e-4535-ae15-948301cbf1ae] libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-09-27T06:45:19.118089Z qemu-kvm: kvm_init_vcpu failed: Cannot allocate memory 2019-09-27 16:45:19.785 7 ERROR nova.compute.manager [instance: ebe4e78c-501e-4535-ae15-948301cbf1ae] Numa cell/node 1 (the one assigned on kudu-4) has enough cpu, memory, pci devices and disk capacity to fit this vm. NOTE: below is the information relevant I could think of that shows resources available after creating the second vm. [root@zeus-53 ~]# numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 28 29 30 31 32 33 34 35 36 37 38 39 40 41 node 0 size: 262029 MB node 0 free: 52787 MB node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 42 43 44 45 46 47 48 49 50 51 52 53 54 55 node 1 size: 262144 MB node 1 free: 250624 MB node distances: node 0 1 0: 10 21 1: 21 10 NOTE: this is to show that numa node/cell 1 has enough resources available (also nova-compute logs shows that kudu-4 is assigned to cell 1) [root@zeus-53 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/md127 3.7T 9.1G 3.7T 1% / ... NOTE: vm disk files goes to root (/) partition [root@zeus-53 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 59.6G 0 disk ├─sda1 8:1 0 1G 0 part /boot └─sda2 8:2 0 16G 0 part [SWAP] nvme0n1 259:8 0 1.8T 0 disk └─nvme0n1p1 259:9 0 1.8T 0 part └─md127 9:127 0 3.7T 0 raid0 / nvme1n1 259:6 0 1.8T 0 disk └─nvme1n1p1 259:7 0 1.8T 0 part └─md127 9:127 0 3.7T 0 raid0 / nvme2n1 259:2 0 1.8T 0 disk nvme3n1 259:1 0 1.8T 0 disk nvme4n1 259:0 0 1.8T 0 disk nvme5n1 259:3 0 1.8T 0 disk NOTE: this is to show that there are 4 nvme disks (nvme2n1, nvme3n1, nvme4n1, nvme5n1) available for the second vm What "emu-kvm: kvm_init_vcpu failed: Cannot allocate memory" means in this context? Thank you very much NOTICE Please consider the environment before printing this email. This message and any attachments are intended for the addressee named and may contain legally privileged/confidential/copyright information. If you are not the intended recipient, you should not read, use, disclose, copy or distribute this communication. If you have received this message in error please notify us at once by return email and then delete both messages. We accept no liability for the distribution of viruses or similar in electronic communications. This notice should not be removed.
On Fri, Sep 27, 2019 at 07:15:47AM +0000, Manuel Sopena Ballesteros wrote:
Dear Openstack user community,
Hi, Manuel,
I have a compute node with 2 numa nodes and I would like to create 2 vms, each one using a different numa node through numa affinity with cpu, memory and nvme pci devices.
[...]
2019-09-27 16:45:19.785 7 ERROR nova.compute.manager [req-b5a25c73-8c7d-466c-8128-71f29e7ae8aa 91e83343e9834c8ba0172ff369c8acac b91520cff5bd45c59a8de07c38641582 - default default] [instance: ebe4e78c-501e-4535-ae15-948301cbf1ae] Instance failed to spawn: libvirtError: internal error: qemu unexpectedly closed the monitor: 2019-09-27T06:45:19.118089Z qemu-kvm: kvm_init_vcpu failed: Cannot allocate memory
[...] This is a known issue. (Eerily enough, I've been debugging this issue the last couple of days.) tl;dr - Using Linux kernel 4.19 or above (with the below commit commit) should fix this. If using 4.19 kernel is not possible, ask your Linux vendor to backport this small fix: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i... It's an absolutely valid requst. It would be great if you can confirm. (Also, can you please file a formal Nova bug here? -- https://bugs.launchpad.net/nova) Long (and complex) story ------------------------ [The root cause is a complex interaction between libvirt, QEMU/KVM, CGroups, and the kernel. I myself don't understand some of the CGroups interaction.] Today, Nova hard-codes the 'strict' memory allocation mode (there's no way to configure it in Nova) when tuning NUMA config: <numatune> <memory mode='strict' nodeset='1'/> </numatune> Where 'strict' means, libvirt must prevent QEMU/KVM from allocating memory all other nodes, _except_ for node-1. The consequence of that is when QEMU initializes, KVM needs to allocate some memory from the "DMA32" zone (it is one of the "zones" into which the kernel divides the system memory into). If that DMA32 zone is _not_ present on node-1, then memory allocation fails and in turn the VM boot fails to start with: "kvm_init_vcpu failed: Cannot allocate memory". - - - So, if using upstream kernel 4.19 (or a vendor-specific kernel that doesn't have the backport fix), then an alternative is to make Nova use the 'preferred' mode which relaxes the 'strict' + "DMA32 zone must present" requirement. See the WIP patch here: https://review.opendev.org/#/c/684375/ -- "libvirt: Use the `preferred` memory allocation mode for NUMA" Where 'preferred' means: disable NUMA affinity; and turn the memory allocation request into a "hint", i.e. "if possible, allocate from the given node-1; otherwise, fallback to other NUMA nodes". Additional info --------------- (*) For the kernel fix mentioned earlier, see the exact same problem reported here: https://lkml.org/lkml/2018/7/24/843 -- VM boot failure on nodes not having DMA32 zone. (*) My investigation over the last two days uncovered a longer libvirt story here with regards to memory allocation and honoring NUMA config. But I won't get into it here for brevity's sake. If you're interested, just ask, I can point to the relevant libvirt Git history and mailing list posts. [...]
NOTE: this is to show that numa node/cell 1 has enough resources available (also nova-compute logs shows that kudu-4 is assigned to cell 1)
As you have guessed, the problem is _not_ that "there is not enough memory", but that the guest's memory is not allocated on the _correct_ NUMA node with "DMA32" region. Can you also get: - The versions of your host kernel, libvirt and QEMU - The output of: `grep DMA /proc/zoneinfo` (I am almost certain that in your output only one of the two nodes has "DMA32" region.) [...]
What "emu-kvm: kvm_init_vcpu failed: Cannot allocate memory" means in this context?
Hope my earlier explanation answers it, even if not entirely satisfactory :-) -- /kashyap
participants (2)
-
Kashyap Chamarthy
-
Manuel Sopena Ballesteros