[nova] Mempage fun

Stephen Finucane sfinucan at redhat.com
Tue Jan 8 18:38:49 UTC 2019

On Tue, 2019-01-08 at 08:54 +0000, Bal√°zs Gibizer wrote:
> On Mon, Jan 7, 2019 at 6:32 PM, Stephen Finucane <sfinucan at redhat.com> wrote:
> > We've been looking at a patch that landed some months ago and have
> > spotted some issues:
> > 
> > https://review.openstack.org/#/c/532168
> > 
> > In summary, that patch is intended to make the memory check for
> > instances memory pagesize aware. The logic it introduces looks
> > something like this:
> > 
> >    If the instance requests a specific pagesize
> >       (#1) Check if each host cell can provide enough memory of the
> >       pagesize requested for each instance cell
> >    Otherwise
> >       If the host has hugepages
> >          (#2) Check if each host cell can provide enough memory of the
> >          smallest pagesize available on the host for each instance cell
> >       Otherwise
> >          (#3) Check if each host cell can provide enough memory for
> >          each instance cell, ignoring pagesizes
> > 
> > This also has the side-effect of allowing instances with hugepages and
> > instances with a NUMA topology but no hugepages to co-exist on the same
> > host, because the latter will now be aware of hugepages and won't
> > consume them. However, there are a couple of issues with this:
> > 
> >    1. It breaks overcommit for instances without pagesize request
> >       running on hosts with different pagesizes. This is because we don't
> >       allow overcommit for hugepages, but case (#2) above means we are now
> >       reusing the same functions previously used for actual hugepage
> >       checks to check for regular 4k pages
> >    2. It doesn't fix the issue when non-NUMA instances exist on the same
> >       host as NUMA instances with hugepages. The non-NUMA instances don't
> >       run through any of the code above, meaning they're still not
> >       pagesize aware
> > 
> > We could probably fix issue (1) by modifying those hugepage functions
> > we're using to allow overcommit via a flag that we pass for case (#2).
> > We can mitigate issue (2) by advising operators to split hosts into
> > aggregates for 'hw:mem_page_size' set or unset (in addition to
> > 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but
> > I think this may be the case in some docs (sean-k-mooney said Intel
> > used to do this. I don't know about Red Hat's docs or upstream). In
> > addition, we did actually called that out in the original spec:
> > 
> > https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-driver-large-pages.html#other-deployer-impact
> > 
> > However, if we're doing that for non-NUMA instances, one would have to
> > question why the patch is necessary/acceptable for NUMA instances. For
> > what it's worth, a longer fix would be to start tracking hugepages in 
> > a non-NUMA aware way too but that's a lot more work and doesn't fix the
> > issue now.
> > 
> > As such, my question is this: should be look at fixing issue (1) and
> > documenting issue (2), or should we revert the thing wholesale until 
> > we work on a solution that could e.g. let us track hugepages via 
> > placement and resolve issue (2) too.
> If you feel that fixing (1) is pretty simple then I suggest to do that 
> and document the limitation of (2) while we think about a proper 
> solution.
> gibi

I have (1) fixed here:


That said, I'm not sure if it's the best thing to do. From what I'm
hearing, it seems the advice we should be giving is to not mix
instances with/without NUMA topologies, with/without hugepages and
with/without CPU pinning. We've only documented the latter, as
discussed on this related bug by cfriesen:


Given that we should be advising folks not to mix these (something I
wasn't aware of until now), what does the original patch actually give
us? If you're not mixing instances with/without hugepages, then the
only use case that would fix is booting an instance with a NUMA
topology but no hugepages on a host that had hugepages (because the
instance would be limited to CPUs and memory from one NUMA nodes, but
it's conceivable all available memory could be on another NUMA node).
That seems like a very esoteric use case that might be better solved by
perhaps making the reserved memory configuration option optionally NUMA
specific. This would allow us to mark this hugepage memory, which is
clearly not intended for consumption by nova (remember: this host only
handles non-hugepage instances), as reserved on a per-node basis. I'm
not sure how we would map this to placement, though I'm sure it could
be figured out.

jaypipes is going to have so much fun mapping all this in placement :D


More information about the openstack-discuss mailing list