[nova] Mempage fun
Sahid Orentino Ferdjaoui
sahid.ferdjaoui at canonical.com
Tue Jan 8 14:17:55 UTC 2019
On Tue, Jan 08, 2019 at 12:50:27PM +0100, Sahid Orentino Ferdjaoui wrote:
> On Tue, Jan 08, 2019 at 10:47:47AM +0000, Stephen Finucane wrote:
> > On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote:
> > > On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
> > > > We've been looking at a patch that landed some months ago and have
> > > > spotted some issues:
> > > >
> > > > https://review.openstack.org/#/c/532168
> > > >
> > > > In summary, that patch is intended to make the memory check for
> > > > instances memory pagesize aware. The logic it introduces looks
> > > > something like this:
> > > >
> > > > If the instance requests a specific pagesize
> > > > (#1) Check if each host cell can provide enough memory of the
> > > > pagesize requested for each instance cell
> > > > Otherwise
> > > > If the host has hugepages
> > > > (#2) Check if each host cell can provide enough memory of the
> > > > smallest pagesize available on the host for each instance cell
> > > > Otherwise
> > > > (#3) Check if each host cell can provide enough memory for
> > > > each instance cell, ignoring pagesizes
> > > >
> > > > This also has the side-effect of allowing instances with hugepages and
> > > > instances with a NUMA topology but no hugepages to co-exist on the same
> > > > host, because the latter will now be aware of hugepages and won't
> > > > consume them. However, there are a couple of issues with this:
> > > >
> > > > 1. It breaks overcommit for instances without pagesize request
> > > > running on hosts with different pagesizes. This is because we don't
> > > > allow overcommit for hugepages, but case (#2) above means we are now
> > > > reusing the same functions previously used for actual hugepage
> > > > checks to check for regular 4k pages
> > >
> > > I think that we should not accept any overcommit. Only instances with
> > > an InstanceNUMATopology associated pass to this part of check. Such
> > > instances want to use features like guest NUMA topology so their
> > > memory mapped on host NUMA nodes or CPU pinning. Both cases are used
> > > for performance reason and to avoid any cross memory latency.
> >
> > This issue with this is that we had previously designed everything *to*
> > allow overcommit:
> >
> > https://github.com/openstack/nova/blob/18.0.0/nova/virt/hardware.py#L1047-L1065
>
> This code never worked Stephen, that instead of to please unit tests
> related. I would not recommend to use it as a reference.
>
> > The only time this doesn't apply is if CPU pinning is also in action
> > (remembering that CPU pinning and NUMA topologies are tightly bound and
> > CPU pinning implies a NUMA topology, much to Jay's consternation). As
> > noted below, our previous advice was not to mix hugepage instances and
> > non-hugepage instances, meaning hosts handling non-hugepage instances
> > should not have hugepages (or should mark the memory consumed by them
> > as reserved for host). We have in effect broken previous behaviour in
> > the name of solving a bug that didn't necessarily have to be fixed yet.
> >
> > > > 2. It doesn't fix the issue when non-NUMA instances exist on the same
> > > > host as NUMA instances with hugepages. The non-NUMA instances don't
> > > > run through any of the code above, meaning they're still not
> > > > pagesize aware
> > >
> > > That is an other issue. We report to the resource tracker all the
> > > physical memory (small pages + hugepages allocated). The difficulty is
> > > that we can't just change the virt driver to report only small
> > > pages. Some instances wont be able to get scheduled. We should
> > > basically change the resource tracker so it can take into account the
> > > different kind of page memory.
> >
> > Agreed (likely via move tracking of this resource to placement, I
> > assume). It's a longer term fix though.
> >
> > > But it's not really an issue since instances that use "NUMA features"
> > > (in Nova world) should be isolated to an aggregate and not be mixed
> > > with no-NUMA instances. The reason is simple no-NUMA instances do not
> > > have boundaries and break rules of NUMA instances.
> >
> > Again, we have to be careful not to mix up NUMA and CPU pinning. It's
> > perfectly fine to have NUMA without CPU pinning, though not the other
> > way around. For example:
> >
> > $ openstack flavor set --property hw:numa_nodes=2 FLAVOR
> >
> > >From what I can tell, there are three reasons that an instance will
> > have a NUMA topology: the user explicitly requested one, the user
> > requested CPU pinning and got one implicitly, or the user requested a
> > specific pagesize and, again, got one implicitly. We handle the latter
> > two with the advice given below, but I don't think anyone has ever said
> > we must separate instances that had a user-specified NUMA topology from
> > those that had no NUMA topology. If we're going down this path, we need
> > clear docs.
Now I remember why we can't support it. When defining guest NUMA
topology (hw:numa_node) the memory is mapped to the assigned host NUMA
nodes meaning that the guest memory can't swap out. If a non-NUMA
instance starts using memory from host NUMA nodes used by a guest with
NUMA it can result that the guest with NUMA run out of memory and be
killed.
>
> The implementation is pretty old and it was a first design from
> scratch, all the situations have not been take into account or been
> documented. If we want create specific behaviors we are going to add
> more complexity on something which is already, and which is not
> completely stable, as an example the patch you have mentioned which
> has been merged last release.
>
> I agree documenting is probably where we should go; don't try to mix
> instances with InstanceNUMATopology and without, Nova uses a different
> way to compute their resources, like don't try to overcommit such
> instances.
>
> We basically recommend to use aggregate for pinning, realtime,
> hugepages, so it looks reasonable to add guest NUMA topology to that
> list.
>
> > Stephen
> >
> > > > We could probably fix issue (1) by modifying those hugepage functions
> > > > we're using to allow overcommit via a flag that we pass for case (#2).
> > > > We can mitigate issue (2) by advising operators to split hosts into
> > > > aggregates for 'hw:mem_page_size' set or unset (in addition to
> > > > 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but
> > > > I think this may be the case in some docs (sean-k-mooney said Intel
> > > > used to do this. I don't know about Red Hat's docs or upstream). In
> > > > addition, we did actually called that out in the original spec:
> > > >
> > > > https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-driver-large-pages.html#other-deployer-impact
> > > >
> > > > However, if we're doing that for non-NUMA instances, one would have to
> > > > question why the patch is necessary/acceptable for NUMA instances. For
> > > > what it's worth, a longer fix would be to start tracking hugepages in a
> > > > non-NUMA aware way too but that's a lot more work and doesn't fix the
> > > > issue now.
> > > >
> > > > As such, my question is this: should be look at fixing issue (1) and
> > > > documenting issue (2), or should we revert the thing wholesale until we
> > > > work on a solution that could e.g. let us track hugepages via placement
> > > > and resolve issue (2) too.
> > > >
> > > > Thoughts?
> > > > Stephen
> > > >
> >
More information about the openstack-discuss
mailing list