[openstack-dev] [nova] NUMA, huge pages, and scheduling

Paul Michali pc at michali.net
Fri Jun 10 12:58:29 UTC 2016


I'll try to reproduce and collect logs for a bug report.

Thanks for the info.

PCM


On Thu, Jun 9, 2016 at 9:43 AM Matt Riedemann <mriedem at linux.vnet.ibm.com>
wrote:

> On 6/9/2016 6:15 AM, Paul Michali wrote:
> >
> >
> > On Wed, Jun 8, 2016 at 11:21 PM Chris Friesen
> > <chris.friesen at windriver.com <mailto:chris.friesen at windriver.com>>
> wrote:
> >
> >     On 06/03/2016 12:03 PM, Paul Michali wrote:
> >     > Thanks for the link Tim!
> >     >
> >     > Right now, I have two things I'm unsure about...
> >     >
> >     > One is that I had 1945 huge pages left (of size 2048k) and tried
> >     to create a VM
> >     > with a small flavor (2GB), which should need 1024 pages, but Nova
> >     indicated that
> >     > it wasn't able to find a host (and QEMU reported an allocation
> issue).
> >     >
> >     > The other is that VMs are not being evenly distributed on my two
> >     NUMA nodes, and
> >     > instead, are getting created all on one NUMA node. Not sure if
> >     that is expected
> >     > (and setting mem_page_size to 2048 is the proper way).
> >
> >
> >     Just in case you haven't figured out the problem...
> >
> >     Have you checked the per-host-numa-node 2MB huge page availability
> >     on your host?
> >       If it's uneven then that might explain what you're seeing.
> >
> >
> > These are the observations/questions I have:
> >
> > 1) On the host, I was seeing 32768 huge pages, of 2MB size. When I
> > created VMs (Cirros) using small flavor, each VM was getting created on
> > NUMA nodeid 0. When it hit half of the available pages, I could no
> > longer create any VMs (QEMU saying no space). I'd like to understand why
> > the assignment was always going two nodeid 0, and to confirm that the
> > huge pages are divided among the number of NUMA nodes available.
> >
> > 2) I changed mem_page_size from 1024 to 2048 in the flavor, and then
> > when VMs were created, they were being evenly assigned to the two NUMA
> > nodes. Each using 1024 huge pages. At this point I could create more
> > than half, but when there were 1945 pages left, it failed to create a
> > VM. Did it fail because the mem_page_size was 2048 and the available
> > pages were 1945, even though we were only requesting 1024 pages?
> >
> > 3) Related to #2, is there a relationship between mem_page_size, the
> > allocation of VMs to NUMA nodes, and the flavor size? IOW, if I use the
> > medium flavor (4GB), will I need a larger mem_page_size? (I'll play with
> > this variation, as soon as I can). Gets back to understanding how the
> > scheduling determines how to assign the VMs.
> >
> > 4) When the VM create failed due to QEMU failing allocation, the VM went
> > to error state. I deleted the VM, but the neutron port was still there,
> > and there were no log messages indicating that a request was made to
> > delete the port. Is this expected (that the user would have to manually
> > clean up the port)?
>
> When you hit this case, can you check if instance.host is set in the
> database before deleting the instance? I'm guessing what's happening is
> the instance didn't get assigned a host since it eventually ended up
> with NoValidHost, so when you go to delete it doesn't have a compute to
> send it to for delete, so it deletes from the compute API, and we don't
> have the host binding details to delete the port.
>
> Although, when the spawn failed in the compute to begin with we should
> have deallocated any networking that was created before kicking back to
> the scheduler - unless we don't go back to the scheduler if the instance
> is set to ERROR state.
>
> A bug report with stacktrace of the failure scenario when the instance
> goes to error state bug n-cpu logs would probably help.
>
> >
> > 5) A coworker had hit the problem mentioned in #1, with exhaustion at
> > the halfway point. If she delete's a VM, and then changes the flavor to
> > change the mem_page_size to 2048, should Nova start assigning all new
> > VMs to the other NUMA node, until the pool of huge pages is down to
> > where the huge pages are for NUMA node 0, or will it alternate between
> > the available NUMA nodes (and run out when node 0's pool is exhausted)?
> >
> > Thanks in advance!
> >
> > PCM
> >
> >
> >
> >
> >     Chris
> >
> >
>  __________________________________________________________________________
> >     OpenStack Development Mailing List (not for usage questions)
> >     Unsubscribe:
> >     OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> >     <
> http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> >     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> >
> >
> __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
> --
>
> Thanks,
>
> Matt Riedemann
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160610/dcfa6899/attachment.html>


More information about the OpenStack-dev mailing list