[nova] Mempage fun

Stephen Finucane sfinucan at redhat.com
Tue Jan 8 18:29:03 UTC 2019


On Tue, 2019-01-08 at 15:17 +0100, Sahid Orentino Ferdjaoui wrote:
> On Tue, Jan 08, 2019 at 12:50:27PM +0100, Sahid Orentino Ferdjaoui wrote:
> > On Tue, Jan 08, 2019 at 10:47:47AM +0000, Stephen Finucane wrote:
> > > On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote:
> > > > On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
> > > > > We've been looking at a patch that landed some months ago and have
> > > > > spotted some issues:
> > > > > 
> > > > > https://review.openstack.org/#/c/532168
> > > > > 
> > > > > In summary, that patch is intended to make the memory check for
> > > > > instances memory pagesize aware. The logic it introduces looks
> > > > > something like this:
> > > > > 
> > > > >    If the instance requests a specific pagesize
> > > > >       (#1) Check if each host cell can provide enough memory of the
> > > > >       pagesize requested for each instance cell
> > > > >    Otherwise
> > > > >       If the host has hugepages
> > > > >          (#2) Check if each host cell can provide enough memory of the
> > > > >          smallest pagesize available on the host for each instance cell
> > > > >       Otherwise
> > > > >          (#3) Check if each host cell can provide enough memory for
> > > > >          each instance cell, ignoring pagesizes
> > > > > 
> > > > > This also has the side-effect of allowing instances with hugepages and
> > > > > instances with a NUMA topology but no hugepages to co-exist on the same
> > > > > host, because the latter will now be aware of hugepages and won't
> > > > > consume them. However, there are a couple of issues with this:
> > > > > 
> > > > >    1. It breaks overcommit for instances without pagesize request
> > > > >       running on hosts with different pagesizes. This is because we don't
> > > > >       allow overcommit for hugepages, but case (#2) above means we are now
> > > > >       reusing the same functions previously used for actual hugepage
> > > > >       checks to check for regular 4k pages
> > > > 
> > > > I think that we should not accept any overcommit. Only instances with
> > > > an InstanceNUMATopology associated pass to this part of check. Such
> > > > instances want to use features like guest NUMA topology so their
> > > > memory mapped on host NUMA nodes or CPU pinning. Both cases are used
> > > > for performance reason and to avoid any cross memory latency.
> > > 
> > > This issue with this is that we had previously designed everything *to*
> > > allow overcommit:
> > > 
> > > https://github.com/openstack/nova/blob/18.0.0/nova/virt/hardware.py#L1047-L1065
> > 
> > This code never worked Stephen, that instead of to please unit tests
> > related. I would not recommend to use it as a reference.
> > 
> > > The only time this doesn't apply is if CPU pinning is also in action
> > > (remembering that CPU pinning and NUMA topologies are tightly bound and
> > > CPU pinning implies a NUMA topology, much to Jay's consternation). As
> > > noted below, our previous advice was not to mix hugepage instances and
> > > non-hugepage instances, meaning hosts handling non-hugepage instances
> > > should not have hugepages (or should mark the memory consumed by them
> > > as reserved for host). We have in effect broken previous behaviour in
> > > the name of solving a bug that didn't necessarily have to be fixed yet.
> > > 
> > > > >    2. It doesn't fix the issue when non-NUMA instances exist on the same
> > > > >       host as NUMA instances with hugepages. The non-NUMA instances don't
> > > > >       run through any of the code above, meaning they're still not
> > > > >       pagesize aware
> > > > 
> > > > That is an other issue. We report to the resource tracker all the
> > > > physical memory (small pages + hugepages allocated). The difficulty is
> > > > that we can't just change the virt driver to report only small
> > > > pages. Some instances wont be able to get scheduled. We should
> > > > basically change the resource tracker so it can take into account the
> > > > different kind of page memory.
> > > 
> > > Agreed (likely via move tracking of this resource to placement, I
> > > assume). It's a longer term fix though.
> > > 
> > > > But it's not really an issue since instances that use "NUMA features"
> > > > (in Nova world) should be isolated to an aggregate and not be mixed
> > > > with no-NUMA instances. The reason is simple no-NUMA instances do not
> > > > have boundaries and break rules of NUMA instances.
> > > 
> > > Again, we have to be careful not to mix up NUMA and CPU pinning. It's
> > > perfectly fine to have NUMA without CPU pinning, though not the other
> > > way around. For example:
> > > 
> > >     $ openstack flavor set --property hw:numa_nodes=2 FLAVOR
> > > 
> > > > From what I can tell, there are three reasons that an instance will
> > > have a NUMA topology: the user explicitly requested one, the user
> > > requested CPU pinning and got one implicitly, or the user requested a
> > > specific pagesize and, again, got one implicitly. We handle the latter
> > > two with the advice given below, but I don't think anyone has ever said
> > > we must separate instances that had a user-specified NUMA topology from
> > > those that had no NUMA topology. If we're going down this path, we need
> > > clear docs.
> 
> Now I remember why we can't support it. When defining guest NUMA
> topology (hw:numa_node) the memory is mapped to the assigned host NUMA
> nodes meaning that the guest memory can't swap out. If a non-NUMA
> instance starts using memory from host NUMA nodes used by a guest with
> NUMA it can result that the guest with NUMA run out of memory and be
> killed.

Based on my minimal test, it seems to work just fine?

https://bugs.launchpad.net/nova/+bug/1810977

The instances boot with the patch reverted. Is there something I've
missed?

> > The implementation is pretty old and it was a first design from
> > scratch, all the situations have not been take into account or been
> > documented. If we want create specific behaviors we are going to add
> > more complexity on something which is already, and which is not
> > completely stable, as an example the patch you have mentioned which
> > has been merged last release.
> > 
> > I agree documenting is probably where we should go; don't try to mix
> > instances with InstanceNUMATopology and without, Nova uses a different
> > way to compute their resources, like don't try to overcommit such
> > instances.
> > 
> > We basically recommend to use aggregate for pinning, realtime,
> > hugepages, so it looks reasonable to add guest NUMA topology to that
> > list.
> > 
> > > Stephen
> > > 
> > > > > We could probably fix issue (1) by modifying those hugepage functions
> > > > > we're using to allow overcommit via a flag that we pass for case (#2).
> > > > > We can mitigate issue (2) by advising operators to split hosts into
> > > > > aggregates for 'hw:mem_page_size' set or unset (in addition to
> > > > > 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but
> > > > > I think this may be the case in some docs (sean-k-mooney said Intel
> > > > > used to do this. I don't know about Red Hat's docs or upstream). In
> > > > > addition, we did actually called that out in the original spec:
> > > > > 
> > > > > https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-driver-large-pages.html#other-deployer-impact
> > > > > 
> > > > > However, if we're doing that for non-NUMA instances, one would have to
> > > > > question why the patch is necessary/acceptable for NUMA instances. For
> > > > > what it's worth, a longer fix would be to start tracking hugepages in a
> > > > > non-NUMA aware way too but that's a lot more work and doesn't fix the
> > > > > issue now.
> > > > > 
> > > > > As such, my question is this: should be look at fixing issue (1) and
> > > > > documenting issue (2), or should we revert the thing wholesale until we
> > > > > work on a solution that could e.g. let us track hugepages via placement
> > > > > and resolve issue (2) too.
> > > > > 
> > > > > Thoughts?
> > > > > Stephen
> > > > > 





More information about the openstack-discuss mailing list