[nova] Mempage fun

Sahid Orentino Ferdjaoui sahid.ferdjaoui at canonical.com
Tue Jan 8 11:50:27 UTC 2019


On Tue, Jan 08, 2019 at 10:47:47AM +0000, Stephen Finucane wrote:
> On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote:
> > On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
> > > We've been looking at a patch that landed some months ago and have
> > > spotted some issues:
> > > 
> > > https://review.openstack.org/#/c/532168
> > > 
> > > In summary, that patch is intended to make the memory check for
> > > instances memory pagesize aware. The logic it introduces looks
> > > something like this:
> > > 
> > >    If the instance requests a specific pagesize
> > >       (#1) Check if each host cell can provide enough memory of the
> > >       pagesize requested for each instance cell
> > >    Otherwise
> > >       If the host has hugepages
> > >          (#2) Check if each host cell can provide enough memory of the
> > >          smallest pagesize available on the host for each instance cell
> > >       Otherwise
> > >          (#3) Check if each host cell can provide enough memory for
> > >          each instance cell, ignoring pagesizes
> > > 
> > > This also has the side-effect of allowing instances with hugepages and
> > > instances with a NUMA topology but no hugepages to co-exist on the same
> > > host, because the latter will now be aware of hugepages and won't
> > > consume them. However, there are a couple of issues with this:
> > > 
> > >    1. It breaks overcommit for instances without pagesize request
> > >       running on hosts with different pagesizes. This is because we don't
> > >       allow overcommit for hugepages, but case (#2) above means we are now
> > >       reusing the same functions previously used for actual hugepage
> > >       checks to check for regular 4k pages
> > 
> > I think that we should not accept any overcommit. Only instances with
> > an InstanceNUMATopology associated pass to this part of check. Such
> > instances want to use features like guest NUMA topology so their
> > memory mapped on host NUMA nodes or CPU pinning. Both cases are used
> > for performance reason and to avoid any cross memory latency.
> 
> This issue with this is that we had previously designed everything *to*
> allow overcommit:
> 
> https://github.com/openstack/nova/blob/18.0.0/nova/virt/hardware.py#L1047-L1065

This code never worked Stephen, that instead of to please unit tests
related. I would not recommend to use it as a reference.

> The only time this doesn't apply is if CPU pinning is also in action
> (remembering that CPU pinning and NUMA topologies are tightly bound and
> CPU pinning implies a NUMA topology, much to Jay's consternation). As
> noted below, our previous advice was not to mix hugepage instances and
> non-hugepage instances, meaning hosts handling non-hugepage instances
> should not have hugepages (or should mark the memory consumed by them
> as reserved for host). We have in effect broken previous behaviour in
> the name of solving a bug that didn't necessarily have to be fixed yet.
> 
> > >    2. It doesn't fix the issue when non-NUMA instances exist on the same
> > >       host as NUMA instances with hugepages. The non-NUMA instances don't
> > >       run through any of the code above, meaning they're still not
> > >       pagesize aware
> > 
> > That is an other issue. We report to the resource tracker all the
> > physical memory (small pages + hugepages allocated). The difficulty is
> > that we can't just change the virt driver to report only small
> > pages. Some instances wont be able to get scheduled. We should
> > basically change the resource tracker so it can take into account the
> > different kind of page memory.
> 
> Agreed (likely via move tracking of this resource to placement, I
> assume). It's a longer term fix though.
> 
> > But it's not really an issue since instances that use "NUMA features"
> > (in Nova world) should be isolated to an aggregate and not be mixed
> > with no-NUMA instances. The reason is simple no-NUMA instances do not
> > have boundaries and break rules of NUMA instances.
> 
> Again, we have to be careful not to mix up NUMA and CPU pinning. It's
> perfectly fine to have NUMA without CPU pinning, though not the other
> way around. For example:
> 
>     $ openstack flavor set --property hw:numa_nodes=2 FLAVOR
> 
> >From what I can tell, there are three reasons that an instance will
> have a NUMA topology: the user explicitly requested one, the user
> requested CPU pinning and got one implicitly, or the user requested a
> specific pagesize and, again, got one implicitly. We handle the latter
> two with the advice given below, but I don't think anyone has ever said
> we must separate instances that had a user-specified NUMA topology from
> those that had no NUMA topology. If we're going down this path, we need
> clear docs.

The implementation is pretty old and it was a first design from
scratch, all the situations have not been take into account or been
documented. If we want create specific behaviors we are going to add
more complexity on something which is already, and which is not
completely stable, as an example the patch you have mentioned which
has been merged last release.

I agree documenting is probably where we should go; don't try to mix
instances with InstanceNUMATopology and without, Nova uses a different
way to compute their resources, like don't try to overcommit such
instances.

We basically recommend to use aggregate for pinning, realtime,
hugepages, so it looks reasonable to add guest NUMA topology to that
list.

> Stephen
> 
> > > We could probably fix issue (1) by modifying those hugepage functions
> > > we're using to allow overcommit via a flag that we pass for case (#2).
> > > We can mitigate issue (2) by advising operators to split hosts into
> > > aggregates for 'hw:mem_page_size' set or unset (in addition to
> > > 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but
> > > I think this may be the case in some docs (sean-k-mooney said Intel
> > > used to do this. I don't know about Red Hat's docs or upstream). In
> > > addition, we did actually called that out in the original spec:
> > > 
> > > https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-driver-large-pages.html#other-deployer-impact
> > > 
> > > However, if we're doing that for non-NUMA instances, one would have to
> > > question why the patch is necessary/acceptable for NUMA instances. For
> > > what it's worth, a longer fix would be to start tracking hugepages in a
> > > non-NUMA aware way too but that's a lot more work and doesn't fix the
> > > issue now.
> > > 
> > > As such, my question is this: should be look at fixing issue (1) and
> > > documenting issue (2), or should we revert the thing wholesale until we
> > > work on a solution that could e.g. let us track hugepages via placement
> > > and resolve issue (2) too.
> > > 
> > > Thoughts?
> > > Stephen
> > > 
> 



More information about the openstack-discuss mailing list