[nova] Mempage fun

Sean Mooney smooney at redhat.com
Wed Jan 9 06:11:54 UTC 2019

On Tue, 2019-01-08 at 18:38 +0000, Stephen Finucane wrote:
> On Tue, 2019-01-08 at 08:54 +0000, Bal√°zs Gibizer wrote:
> > On Mon, Jan 7, 2019 at 6:32 PM, Stephen Finucane <sfinucan at redhat.com> wrote:
> > > We've been looking at a patch that landed some months ago and have
> > > spotted some issues:
> > > 
> > > https://review.openstack.org/#/c/532168
> > > 
> > > In summary, that patch is intended to make the memory check for
> > > instances memory pagesize aware. The logic it introduces looks
> > > something like this:
> > > 
> > >    If the instance requests a specific pagesize
> > >       (#1) Check if each host cell can provide enough memory of the
> > >       pagesize requested for each instance cell
> > >    Otherwise
> > >       If the host has hugepages
> > >          (#2) Check if each host cell can provide enough memory of the
> > >          smallest pagesize available on the host for each instance cell
> > >       Otherwise
> > >          (#3) Check if each host cell can provide enough memory for
> > >          each instance cell, ignoring pagesizes
> > > 
> > > This also has the side-effect of allowing instances with hugepages and
> > > instances with a NUMA topology but no hugepages to co-exist on the same
> > > host, because the latter will now be aware of hugepages and won't
> > > consume them. However, there are a couple of issues with this:
> > > 
> > >    1. It breaks overcommit for instances without pagesize request
> > >       running on hosts with different pagesizes. This is because we don't
> > >       allow overcommit for hugepages, but case (#2) above means we are now
> > >       reusing the same functions previously used for actual hugepage
> > >       checks to check for regular 4k pages
> > >    2. It doesn't fix the issue when non-NUMA instances exist on the same
> > >       host as NUMA instances with hugepages. The non-NUMA instances don't
> > >       run through any of the code above, meaning they're still not
> > >       pagesize aware
> > > 
> > > We could probably fix issue (1) by modifying those hugepage functions
> > > we're using to allow overcommit via a flag that we pass for case (#2).
> > > We can mitigate issue (2) by advising operators to split hosts into
> > > aggregates for 'hw:mem_page_size' set or unset (in addition to
> > > 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but
> > > I think this may be the case in some docs (sean-k-mooney said Intel
> > > used to do this. I don't know about Red Hat's docs or upstream). In
> > > addition, we did actually called that out in the original spec:
> > > 
> > > 
> > > 
> > > However, if we're doing that for non-NUMA instances, one would have to
> > > question why the patch is necessary/acceptable for NUMA instances. For
> > > what it's worth, a longer fix would be to start tracking hugepages in 
> > > a non-NUMA aware way too but that's a lot more work and doesn't fix the
> > > issue now.
> > > 
> > > As such, my question is this: should be look at fixing issue (1) and
> > > documenting issue (2), or should we revert the thing wholesale until 
> > > we work on a solution that could e.g. let us track hugepages via 
> > > placement and resolve issue (2) too.
> > 
> > If you feel that fixing (1) is pretty simple then I suggest to do that 
> > and document the limitation of (2) while we think about a proper 
> > solution.
> > 
> > gibi
> I have (1) fixed here:
>   https://review.openstack.org/#/c/629281/
> That said, I'm not sure if it's the best thing to do. From what I'm
> hearing, it seems the advice we should be giving is to not mix
> instances with/without NUMA topologies, with/without hugepages and
it should be with and without hw:mem_page_size. guest with that set should not
be mixed with guests without that set on the same host. and with shiad patch and
your patch this now become safe if the guest without hw:mem_page_size has a numa topology.
mixing hugepage and non hugepage guests is fine provided the non hugepage guest has an
implcit or expcit numa toplogy such as a guest that is useing cpu pinning.
> with/without CPU pinning. We've only documented the latter, as
> discussed on this related bug by cfriesen:
>   https://bugs.launchpad.net/nova/+bug/1792985
> Given that we should be advising folks not to mix these (something I
> wasn't aware of until now), what does the original patch actually give
> us? If you're not mixing instances with/without hugepages, then the
> only use case that would fix is booting an instance with a NUMA
> topology but no hugepages on a host that had hugepages (because the
> instance would be limited to CPUs and memory from one NUMA nodes, but
> it's conceivable all available memory could be on another NUMA node).
> That seems like a very esoteric use case that might be better solved by
this is not that esoteric. one simple example is an operator has configred 
some number of hugepges on the hypervior and want to run pinnined instance
some of which have hugepages and somme that dont. this works fine today however
oversubsciption of memory in the non hugepage case is broken as per the bug.
> perhaps making the reserved memory configuration option optionally NUMA
> specific.
well i have been asking for that for 2-3 releases. i would like to do that independenly
of this issue and i think it will be a requirement if we ever model mempages per numa node
in placement.
>  This would allow us to mark this hugepage memory, which is
> clearly not intended for consumption by nova (remember: this host only
> handles non-hugepage instances)
again it is safe to mix hugepage instance with non hugepages instance if hw:mem_page_size is 
set in the non hugepage case. but with your senario in mind we can already resrve the hugepage memory
for the host use by setting reserved_huge_pages in the default section of the nova.conf
> , as reserved on a per-node basis. I'm
> not sure how we would map this to placement, though I'm sure it could
> be figured out.
that is simple. the placement inventory would just have the reserved value set to the value for 
the reserved_huge_pages config option.
> jaypipes is going to have so much fun mapping all this in placement :D
we have disscued this at lenght before so placement can already model this quite well if nova
created the RPs and inventories for mempages. the main question is can we stop modeling memory_mb
inventories in the root compute node RP entirely. i personcally would like to make
all instances numa affined by default. e.g. we woudl start treading all instances as if
hw:numa_nodes=1 was set and preferabley hw:mem_page_size=small.
this would signifcantly simplfy our lives in placement but it has a down side that if you want to create really large
instance they must be multi numa. e.g. if the guest will be larger then will fit in a singel host numa node it must have
have hw:numa_nodes>1 to be schduled. the simple fact is that such an instance is already spanning host numa nodes and
but we are not tell ing the guest that. by actully telling the geust it has multiple numa nodes it will imporve the
guest perfromance but its a behavior change that not everyone will like.

Our current practics or tracking memory and cpus both per numa node and per host is tech debt that we need to clean
up at some point or live with the fact that numa will never be modeled in placement. we already
have numa afinity for vswitch, pci/sriov devices and we will/should have it for vgpus and pmem in the future.
long term i think we would only track things per numa node but i know sylvain has a detailed spec on this
which has more context the we can resonably discuss here.

> Stephen

More information about the openstack-discuss mailing list