issues creating a second vm with numa affinity

Kashyap Chamarthy kchamart at redhat.com
Fri Sep 27 14:14:04 UTC 2019


On Fri, Sep 27, 2019 at 07:15:47AM +0000, Manuel Sopena Ballesteros wrote:
> Dear Openstack user community,

Hi, Manuel,

> I have a compute node with 2 numa nodes and I would like to create 2
> vms, each one using a different numa node through numa affinity with
> cpu, memory and nvme pci devices.

[...]

> 2019-09-27 16:45:19.785 7 ERROR nova.compute.manager
> [req-b5a25c73-8c7d-466c-8128-71f29e7ae8aa
> 91e83343e9834c8ba0172ff369c8acac b91520cff5bd45c59a8de07c38641582 -
> default default] [instance: ebe4e78c-501e-4535-ae15-948301cbf1ae]
> Instance failed to spawn: libvirtError: internal error: qemu
> unexpectedly closed the monitor: 2019-09-27T06:45:19.118089Z qemu-kvm:
> kvm_init_vcpu failed: Cannot allocate memory

[...]

This is a known issue.  (Eerily enough, I've been debugging this issue
the last couple of days.)

tl;dr - Using Linux kernel 4.19 or above (with the below commit
        commit) should fix this.  If using 4.19 kernel is not possible,
        ask your Linux vendor to backport this small fix:
        https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ee6268ba3a68

        It's an absolutely valid requst.

It would be great if you can confirm.

(Also, can you please file a formal Nova bug here? --
https://bugs.launchpad.net/nova)

Long (and complex) story
------------------------

[The root cause is a complex interaction between libvirt, QEMU/KVM,
CGroups, and the kernel.  I myself don't understand some of the CGroups
interaction.]

Today, Nova hard-codes the 'strict' memory allocation mode (there's no
way to configure it in Nova) when tuning NUMA config:

    <numatune>
      <memory mode='strict' nodeset='1'/>
    </numatune>

Where 'strict' means, libvirt must prevent QEMU/KVM from allocating
memory all other nodes, _except_ for node-1.  The consequence of that is
when QEMU initializes, KVM needs to allocate some memory from the
"DMA32" zone (it is one of the "zones" into which the kernel divides the
system memory into).  If that DMA32 zone is _not_ present on node-1,
then memory allocation fails and in turn the VM boot fails to start
with: "kvm_init_vcpu failed: Cannot allocate memory".

            - - -

So, if using upstream kernel 4.19 (or a vendor-specific kernel that
doesn't have the backport fix), then an alternative is to make Nova use
the 'preferred' mode which relaxes the 'strict' + "DMA32 zone must
present" requirement.  See the WIP patch here:

    https://review.opendev.org/#/c/684375/ -- "libvirt: Use the
    `preferred` memory allocation mode for NUMA"

Where 'preferred' means: disable NUMA affinity; and turn the memory
allocation request into a "hint", i.e. "if possible, allocate from the
given node-1; otherwise, fallback to other NUMA nodes".


Additional info
---------------

(*) For the kernel fix mentioned earlier, see the exact same problem
    reported here: https://lkml.org/lkml/2018/7/24/843 -- VM boot
    failure on nodes not having DMA32 zone.

(*) My investigation over the last two days uncovered a longer libvirt
    story here with regards to memory allocation and honoring NUMA
    config.  But I won't get into it here for brevity's sake.  If you're
    interested, just ask, I can point to the relevant libvirt Git
    history and mailing list posts.

[...]

> NOTE: this is to show that numa node/cell 1 has enough resources
> available (also nova-compute logs shows that kudu-4 is assigned to
> cell 1)

As you have guessed, the problem is _not_ that "there is not enough
memory", but that the guest's memory is not allocated on the _correct_
NUMA node with "DMA32" region.

Can you also get: 

  - The versions of your host kernel, libvirt and QEMU

  - The output of: `grep DMA /proc/zoneinfo`

    (I am almost certain that in your output only one of the two nodes
    has "DMA32" region.)

[...]

> 
> What "emu-kvm: kvm_init_vcpu failed: Cannot allocate memory" means in
> this context?

Hope my earlier explanation answers it, even if not entirely
satisfactory :-)


-- 
/kashyap



More information about the openstack-discuss mailing list