Re: [nova] Mempage fun

8 Jan 2019

      ...
On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
...
We've been looking at a patch that landed some months ago and have
spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for
instances memory pagesize aware. The logic it introduces looks
something like this:
If the instance requests a specific pagesize
      (#1) Check if each host cell can provide enough memory of the
      pagesize requested for each instance cell
   Otherwise
      If the host has hugepages
         (#2) Check if each host cell can provide enough memory of the
         smallest pagesize available on the host for each instance cell
      Otherwise
         (#3) Check if each host cell can provide enough memory for
         each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and
instances with a NUMA topology but no hugepages to co-exist on the same
host, because the latter will now be aware of hugepages and won't
consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request
      running on hosts with different pagesizes. This is because we don't
      allow overcommit for hugepages, but case (#2) above means we are now
      reusing the same functions previously used for actual hugepage
      checks to check for regular 4k pages
I think that we should not accept any overcommit. Only instances with
an InstanceNUMATopology associated pass to this part of check. Such
instances want to use features like guest NUMA topology so their
memory mapped on host NUMA nodes or CPU pinning. Both cases are used
for performance reason and to avoid any cross memory latency.
On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote:
that is not nessisarialy correct.
if i request cpu pinning that does not imply that i dont want the ablitiy 
memory over subsricption. that is an artifact of how we chose to implement pinning
in the libvirt driver. for the case of cpu pinning specifically i have alway
felt it is wrong that we create a numa toplogy for the geust implicitly.
in the case of hw:numa_nodes=1 in the absence of any other extra spec or
image metadata i also do not think it is correct ot disabel over subsription
retroativly after supportin it for several years. requesting a numa toplogy
out side of requesting huge pages explcitly shoudl never have disabled over subsription
and changing that behavior should have both required a microvirtion and a nova spec.
https://review.openstack.org/#/c/532168 was simply a bug fix and therefor shoudl not
have changed the meaning of requesting a numa topology.
...
...
2. It doesn't fix the issue when non-NUMA instances exist on the same
      host as NUMA instances with hugepages. The non-NUMA instances don't
      run through any of the code above, meaning they're still not
      pagesize aware
That is an other issue. We report to the resource tracker all the
physical memory (small pages + hugepages allocated). The difficulty is
that we can't just change the virt driver to report only small
pages. Some instances wont be able to get scheduled. We should
basically change the resource tracker so it can take into account the
different kind of page memory.
But it's not really an issue since instances that use "NUMA features"
(in Nova world) should be isolated to an aggregate and not be mixed
with no-NUMA instances. The reason is simple no-NUMA instances do not
have boundaries and break rules of NUMA instances.
it is true that we should partion deployment useing host aggreates for
host with numa instance today and host that have non numa instances.
the reason issue 2 was raised is that the commit message implied that the patch
addressed mixing numa and non numa guest on the same host.

"Also when no pagesize is requested we should consider to compute memory usage based on small pages since the amount of
physical memory available may also include some large pages."

but the logic in the patch does not actully get triggered when the guest does not have numa topology so
it does not actully consider the total number of small page in that case.
this was linked to a down bugzilla you filed https://bugzilla.redhat.com/show_bug.cgi?id=1625119
and another for osp 10 https://bugzilla.redhat.com/show_bug.cgi?id=1519540
which has 3 costomer cases associated with it that downstream bugs. on closer
inspection of the patch dose not address the downstream bug at all as it
is expressly stating that nova does not consider small pages when mem_page_size is not
set but since we dont execute this code of non numa guest we dont actully resovle the issue.

when we consider the costomer cases associated with this specifcally the 3 the donwstream
bug claims to resove are
1.) a sheudler race where two pinned instances get shduled to the same set of resouces (this can only be fixed with
placement
2.) mixing hugepage and non huge page guest reulted in OOM events
3.) instnace with a numa toplogy nolonger respect ram allocation ration.

the third customer issue was directly caused but backporting this patach.
the second issue would be resoved by using host aggreates  to segerage hugepage host from non numa hosts
and the first cant be addressed without premtivly claiming cpu/hugepages in the schduelr/placement.
...
...
We could probably fix issue (1) by modifying those hugepage functions
we're using to allow overcommit via a flag that we pass for case (#2).
We can mitigate issue (2) by advising operators to split hosts into
aggregates for 'hw:mem_page_size' set or unset (in addition to
'hw:cpu_policy' set to dedicated or shared/unset). I need to check but
I think this may be the case in some docs (sean-k-mooney said Intel
used to do this. I don't know about Red Hat's docs or upstream). In
addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
...
However, if we're doing that for non-NUMA instances, one would have to
question why the patch is necessary/acceptable for NUMA instances. For
what it's worth, a longer fix would be to start tracking hugepages in a
non-NUMA aware way too but that's a lot more work and doesn't fix the
issue now.
As such, my question is this: should be look at fixing issue (1) and
documenting issue (2), or should we revert the thing wholesale until we
work on a solution that could e.g. let us track hugepages via placement
and resolve issue (2) too.
Thoughts?
Stephen

Re: [nova] Mempage fun

Sean Mooney