We've been looking at a patch that landed some months ago and have spotted some issues: https://review.openstack.org/#/c/532168 In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this: If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this: 1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages 2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec: https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr... However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now. As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too. Thoughts? Stephen
On Mon, 2019-01-07 at 17:32 +0000, Stephen Finucane wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages 2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
for what its worth the review in question https://review.openstack.org/#/c/532168 actully attempts to implement option 1/ form https://bugs.launchpad.net/nova/+bug/1439247 the frist time i tried to fix issue 2 was with my proposal for the AggregateTypeExtraSpecsAffinityFilter https://review.openstack.org/#/c/183876/4/specs/liberty/approved/aggregate-f... which became the out tree AggregateInstanceTypeFilter after 3 cycles of trying to get it upstream. https://github.com/openstack/nfv-filters/blob/master/nfv_filters/nova/schedu... the AggregateTypeExtraSpecsAffinityFilter or AggregateInstanceTypeFilter was a filter we deveopled spcifically to enforce seperation of instnace that uses explict memoery pages form those that did not and also to cater for dpdk hugepage requirement and enforce seperation of pinnind an unpinned guests. we finally got approval to publish a blog on the topin in january of 2017 https://software.intel.com/en-us/blogs/2017/01/04/filter-by-host-aggregate-m... based on the the content in the second version of the spec https://review.openstack.org/#/c/314097/12/specs/newton/approved/aggregate-i... this filter was used in semi production 4g trial deployment in addtion to lab use with some parthers i was working with at the time but we decided to stop supporting it with the assumtion placemnet would solve it :) alot of the capablities of out of tree filter could likely be acived with some extentions to placemnt but are not support by placement today. i have raised the topic in the past of required tratis on a resouce provider that need to be present to in the request for an alloction to be made against the resouce provide. similar i have raised the idea of forbinon traits on a resocue provide that eliminates the resouce provide as a candiatie if present in the requerst. this is an inverse relation ship of the required and forbidin traits we have to day but is what the filter we implmented in 2015 did before placment using aggreate metatdata. i think there is a generalised problem statement here that would be a ligiame usecase for placement out side of simply tracking hugepages (or preferable memory of all page sizes) in placement. i would be infavor of fixing oversubsiption which is issue 1 this cycle as that is clearly a bug as a short term solution which we could backport and exploring adressing both issue 1 and 2 with placement or by repoposing the out of tree filter if placement deamed it out of scope. That said i too am interest to hear what other think especially the placement folks. you can jsut use host aggrates and existing filter to address issue 2 but its really easy to get wrong and its not very well documented that it is required.
Thoughts? Stephen
On Mon, Jan 7, 2019 at 6:32 PM, Stephen Finucane <sfinucan@redhat.com> wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages 2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
If you feel that fixing (1) is pretty simple then I suggest to do that and document the limitation of (2) while we think about a proper solution. gibi
Thoughts? Stephen
On Tue, 2019-01-08 at 08:54 +0000, Balázs Gibizer wrote:
On Mon, Jan 7, 2019 at 6:32 PM, Stephen Finucane <sfinucan@redhat.com> wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages 2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
If you feel that fixing (1) is pretty simple then I suggest to do that and document the limitation of (2) while we think about a proper solution.
gibi
I have (1) fixed here: https://review.openstack.org/#/c/629281/ That said, I'm not sure if it's the best thing to do. From what I'm hearing, it seems the advice we should be giving is to not mix instances with/without NUMA topologies, with/without hugepages and with/without CPU pinning. We've only documented the latter, as discussed on this related bug by cfriesen: https://bugs.launchpad.net/nova/+bug/1792985 Given that we should be advising folks not to mix these (something I wasn't aware of until now), what does the original patch actually give us? If you're not mixing instances with/without hugepages, then the only use case that would fix is booting an instance with a NUMA topology but no hugepages on a host that had hugepages (because the instance would be limited to CPUs and memory from one NUMA nodes, but it's conceivable all available memory could be on another NUMA node). That seems like a very esoteric use case that might be better solved by perhaps making the reserved memory configuration option optionally NUMA specific. This would allow us to mark this hugepage memory, which is clearly not intended for consumption by nova (remember: this host only handles non-hugepage instances), as reserved on a per-node basis. I'm not sure how we would map this to placement, though I'm sure it could be figured out. jaypipes is going to have so much fun mapping all this in placement :D Stephen
On Tue, 2019-01-08 at 08:54 +0000, Balázs Gibizer wrote:
On Mon, Jan 7, 2019 at 6:32 PM, Stephen Finucane <sfinucan@redhat.com> wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages 2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
If you feel that fixing (1) is pretty simple then I suggest to do that and document the limitation of (2) while we think about a proper solution.
gibi
I have (1) fixed here:
https://review.openstack.org/#/c/629281/
That said, I'm not sure if it's the best thing to do. From what I'm hearing, it seems the advice we should be giving is to not mix instances with/without NUMA topologies, with/without hugepages and it should be with and without hw:mem_page_size. guest with that set should not be mixed with guests without that set on the same host. and with shiad patch and your patch this now become safe if the guest without hw:mem_page_size has a numa topology. mixing hugepage and non hugepage guests is fine provided the non hugepage guest has an implcit or expcit numa toplogy such as a guest that is useing cpu pinning. with/without CPU pinning. We've only documented the latter, as discussed on this related bug by cfriesen:
https://bugs.launchpad.net/nova/+bug/1792985
Given that we should be advising folks not to mix these (something I wasn't aware of until now), what does the original patch actually give us? If you're not mixing instances with/without hugepages, then the only use case that would fix is booting an instance with a NUMA topology but no hugepages on a host that had hugepages (because the instance would be limited to CPUs and memory from one NUMA nodes, but it's conceivable all available memory could be on another NUMA node). That seems like a very esoteric use case that might be better solved by
perhaps making the reserved memory configuration option optionally NUMA specific. well i have been asking for that for 2-3 releases. i would like to do that independenly of this issue and i think it will be a requirement if we ever model mempages per numa node in placement. This would allow us to mark this hugepage memory, which is clearly not intended for consumption by nova (remember: this host only handles non-hugepage instances) again it is safe to mix hugepage instance with non hugepages instance if hw:mem_page_size is set in the non hugepage case. but with your senario in mind we can already resrve the hugepage memory for the host use by setting reserved_huge_pages in the default section of the nova.conf https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.res... , as reserved on a per-node basis. I'm not sure how we would map this to placement, though I'm sure it could be figured out.
On Tue, 2019-01-08 at 18:38 +0000, Stephen Finucane wrote: this is not that esoteric. one simple example is an operator has configred some number of hugepges on the hypervior and want to run pinnined instance some of which have hugepages and somme that dont. this works fine today however oversubsciption of memory in the non hugepage case is broken as per the bug. that is simple. the placement inventory would just have the reserved value set to the value for the reserved_huge_pages config option.
jaypipes is going to have so much fun mapping all this in placement :D
we have disscued this at lenght before so placement can already model this quite well if nova created the RPs and inventories for mempages. the main question is can we stop modeling memory_mb inventories in the root compute node RP entirely. i personcally would like to make all instances numa affined by default. e.g. we woudl start treading all instances as if hw:numa_nodes=1 was set and preferabley hw:mem_page_size=small. this would signifcantly simplfy our lives in placement but it has a down side that if you want to create really large instance they must be multi numa. e.g. if the guest will be larger then will fit in a singel host numa node it must have have hw:numa_nodes>1 to be schduled. the simple fact is that such an instance is already spanning host numa nodes and but we are not tell ing the guest that. by actully telling the geust it has multiple numa nodes it will imporve the guest perfromance but its a behavior change that not everyone will like. Our current practics or tracking memory and cpus both per numa node and per host is tech debt that we need to clean up at some point or live with the fact that numa will never be modeled in placement. we already have numa afinity for vswitch, pci/sriov devices and we will/should have it for vgpus and pmem in the future. long term i think we would only track things per numa node but i know sylvain has a detailed spec on this which has more context the we can resonably discuss here.
Stephen
On 1/8/2019 12:38 PM, Stephen Finucane wrote:
I have (1) fixed here:
https://review.openstack.org/#/c/629281/
That said, I'm not sure if it's the best thing to do. From what I'm hearing, it seems the advice we should be giving is to not mix instances with/without NUMA topologies, with/without hugepages and with/without CPU pinning. We've only documented the latter, as discussed on this related bug by cfriesen:
https://bugs.launchpad.net/nova/+bug/1792985
Given that we should be advising folks not to mix these (something I wasn't aware of until now), what does the original patch actually give us?
I think we should look at it from the other direction...what is the ultimate *desired* behaviour? Personally, I'm coming at it from a "small-cloud" perspective where we may only have one or two compute nodes. As such, the host-aggregate solution doesn't really work. I would like to be able to run cpu-pinned and cpu-shared instances on the same node. I would like to run small-page (with overcommit) and huge-page (without overcommit) instances on the same node. I would like to run cpu-shared/small-page instances (which float over the whole host) on the same host as a cpu-pinned/small-page instance (which is pinned to specific NUMA nodes). We have a warning in the docs currently that is specifically for separating CPU-pinned and CPU-shared instances, but we also have a spec that plans to specifically support that case. The way the code is currently written we also need to separate NUMA-affined small-page instances from non-NUMA-affined small-page instances, but I think that's a bug, not a sensible design. Chris
On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages
I think that we should not accept any overcommit. Only instances with an InstanceNUMATopology associated pass to this part of check. Such instances want to use features like guest NUMA topology so their memory mapped on host NUMA nodes or CPU pinning. Both cases are used for performance reason and to avoid any cross memory latency.
2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
That is an other issue. We report to the resource tracker all the physical memory (small pages + hugepages allocated). The difficulty is that we can't just change the virt driver to report only small pages. Some instances wont be able to get scheduled. We should basically change the resource tracker so it can take into account the different kind of page memory. But it's not really an issue since instances that use "NUMA features" (in Nova world) should be isolated to an aggregate and not be mixed with no-NUMA instances. The reason is simple no-NUMA instances do not have boundaries and break rules of NUMA instances.
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
Thoughts? Stephen
On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote:
On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages
I think that we should not accept any overcommit. Only instances with an InstanceNUMATopology associated pass to this part of check. Such instances want to use features like guest NUMA topology so their memory mapped on host NUMA nodes or CPU pinning. Both cases are used for performance reason and to avoid any cross memory latency.
This issue with this is that we had previously designed everything *to* allow overcommit: https://github.com/openstack/nova/blob/18.0.0/nova/virt/hardware.py#L1047-L1... The only time this doesn't apply is if CPU pinning is also in action (remembering that CPU pinning and NUMA topologies are tightly bound and CPU pinning implies a NUMA topology, much to Jay's consternation). As noted below, our previous advice was not to mix hugepage instances and non-hugepage instances, meaning hosts handling non-hugepage instances should not have hugepages (or should mark the memory consumed by them as reserved for host). We have in effect broken previous behaviour in the name of solving a bug that didn't necessarily have to be fixed yet.
2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
That is an other issue. We report to the resource tracker all the physical memory (small pages + hugepages allocated). The difficulty is that we can't just change the virt driver to report only small pages. Some instances wont be able to get scheduled. We should basically change the resource tracker so it can take into account the different kind of page memory.
Agreed (likely via move tracking of this resource to placement, I assume). It's a longer term fix though.
But it's not really an issue since instances that use "NUMA features" (in Nova world) should be isolated to an aggregate and not be mixed with no-NUMA instances. The reason is simple no-NUMA instances do not have boundaries and break rules of NUMA instances.
From what I can tell, there are three reasons that an instance will have a NUMA topology: the user explicitly requested one, the user requested CPU pinning and got one implicitly, or the user requested a specific pagesize and, again, got one implicitly. We handle the latter two with the advice given below, but I don't think anyone has ever said we must separate instances that had a user-specified NUMA topology from
Again, we have to be careful not to mix up NUMA and CPU pinning. It's perfectly fine to have NUMA without CPU pinning, though not the other way around. For example: $ openstack flavor set --property hw:numa_nodes=2 FLAVOR those that had no NUMA topology. If we're going down this path, we need clear docs. Stephen
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
Thoughts? Stephen
On Tue, Jan 08, 2019 at 10:47:47AM +0000, Stephen Finucane wrote:
On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote:
On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages
I think that we should not accept any overcommit. Only instances with an InstanceNUMATopology associated pass to this part of check. Such instances want to use features like guest NUMA topology so their memory mapped on host NUMA nodes or CPU pinning. Both cases are used for performance reason and to avoid any cross memory latency.
This issue with this is that we had previously designed everything *to* allow overcommit:
https://github.com/openstack/nova/blob/18.0.0/nova/virt/hardware.py#L1047-L1...
This code never worked Stephen, that instead of to please unit tests related. I would not recommend to use it as a reference.
The only time this doesn't apply is if CPU pinning is also in action (remembering that CPU pinning and NUMA topologies are tightly bound and CPU pinning implies a NUMA topology, much to Jay's consternation). As noted below, our previous advice was not to mix hugepage instances and non-hugepage instances, meaning hosts handling non-hugepage instances should not have hugepages (or should mark the memory consumed by them as reserved for host). We have in effect broken previous behaviour in the name of solving a bug that didn't necessarily have to be fixed yet.
2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
That is an other issue. We report to the resource tracker all the physical memory (small pages + hugepages allocated). The difficulty is that we can't just change the virt driver to report only small pages. Some instances wont be able to get scheduled. We should basically change the resource tracker so it can take into account the different kind of page memory.
Agreed (likely via move tracking of this resource to placement, I assume). It's a longer term fix though.
But it's not really an issue since instances that use "NUMA features" (in Nova world) should be isolated to an aggregate and not be mixed with no-NUMA instances. The reason is simple no-NUMA instances do not have boundaries and break rules of NUMA instances.
Again, we have to be careful not to mix up NUMA and CPU pinning. It's perfectly fine to have NUMA without CPU pinning, though not the other way around. For example:
$ openstack flavor set --property hw:numa_nodes=2 FLAVOR
From what I can tell, there are three reasons that an instance will have a NUMA topology: the user explicitly requested one, the user requested CPU pinning and got one implicitly, or the user requested a specific pagesize and, again, got one implicitly. We handle the latter two with the advice given below, but I don't think anyone has ever said we must separate instances that had a user-specified NUMA topology from those that had no NUMA topology. If we're going down this path, we need clear docs.
The implementation is pretty old and it was a first design from scratch, all the situations have not been take into account or been documented. If we want create specific behaviors we are going to add more complexity on something which is already, and which is not completely stable, as an example the patch you have mentioned which has been merged last release. I agree documenting is probably where we should go; don't try to mix instances with InstanceNUMATopology and without, Nova uses a different way to compute their resources, like don't try to overcommit such instances. We basically recommend to use aggregate for pinning, realtime, hugepages, so it looks reasonable to add guest NUMA topology to that list.
Stephen
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
Thoughts? Stephen
On Tue, Jan 08, 2019 at 12:50:27PM +0100, Sahid Orentino Ferdjaoui wrote:
On Tue, Jan 08, 2019 at 10:47:47AM +0000, Stephen Finucane wrote:
On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote:
On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages
I think that we should not accept any overcommit. Only instances with an InstanceNUMATopology associated pass to this part of check. Such instances want to use features like guest NUMA topology so their memory mapped on host NUMA nodes or CPU pinning. Both cases are used for performance reason and to avoid any cross memory latency.
This issue with this is that we had previously designed everything *to* allow overcommit:
https://github.com/openstack/nova/blob/18.0.0/nova/virt/hardware.py#L1047-L1...
This code never worked Stephen, that instead of to please unit tests related. I would not recommend to use it as a reference.
The only time this doesn't apply is if CPU pinning is also in action (remembering that CPU pinning and NUMA topologies are tightly bound and CPU pinning implies a NUMA topology, much to Jay's consternation). As noted below, our previous advice was not to mix hugepage instances and non-hugepage instances, meaning hosts handling non-hugepage instances should not have hugepages (or should mark the memory consumed by them as reserved for host). We have in effect broken previous behaviour in the name of solving a bug that didn't necessarily have to be fixed yet.
2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
That is an other issue. We report to the resource tracker all the physical memory (small pages + hugepages allocated). The difficulty is that we can't just change the virt driver to report only small pages. Some instances wont be able to get scheduled. We should basically change the resource tracker so it can take into account the different kind of page memory.
Agreed (likely via move tracking of this resource to placement, I assume). It's a longer term fix though.
But it's not really an issue since instances that use "NUMA features" (in Nova world) should be isolated to an aggregate and not be mixed with no-NUMA instances. The reason is simple no-NUMA instances do not have boundaries and break rules of NUMA instances.
Again, we have to be careful not to mix up NUMA and CPU pinning. It's perfectly fine to have NUMA without CPU pinning, though not the other way around. For example:
$ openstack flavor set --property hw:numa_nodes=2 FLAVOR
From what I can tell, there are three reasons that an instance will have a NUMA topology: the user explicitly requested one, the user requested CPU pinning and got one implicitly, or the user requested a specific pagesize and, again, got one implicitly. We handle the latter two with the advice given below, but I don't think anyone has ever said we must separate instances that had a user-specified NUMA topology from those that had no NUMA topology. If we're going down this path, we need clear docs.
Now I remember why we can't support it. When defining guest NUMA topology (hw:numa_node) the memory is mapped to the assigned host NUMA nodes meaning that the guest memory can't swap out. If a non-NUMA instance starts using memory from host NUMA nodes used by a guest with NUMA it can result that the guest with NUMA run out of memory and be killed.
The implementation is pretty old and it was a first design from scratch, all the situations have not been take into account or been documented. If we want create specific behaviors we are going to add more complexity on something which is already, and which is not completely stable, as an example the patch you have mentioned which has been merged last release.
I agree documenting is probably where we should go; don't try to mix instances with InstanceNUMATopology and without, Nova uses a different way to compute their resources, like don't try to overcommit such instances.
We basically recommend to use aggregate for pinning, realtime, hugepages, so it looks reasonable to add guest NUMA topology to that list.
Stephen
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
Thoughts? Stephen
On Tue, Jan 08, 2019 at 12:50:27PM +0100, Sahid Orentino Ferdjaoui wrote:
On Tue, Jan 08, 2019 at 10:47:47AM +0000, Stephen Finucane wrote:
On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote:
On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages
I think that we should not accept any overcommit. Only instances with an InstanceNUMATopology associated pass to this part of check. Such instances want to use features like guest NUMA topology so their memory mapped on host NUMA nodes or CPU pinning. Both cases are used for performance reason and to avoid any cross memory latency.
This issue with this is that we had previously designed everything *to* allow overcommit:
https://github.com/openstack/nova/blob/18.0.0/nova/virt/hardware.py#L1047-L1...
This code never worked Stephen, that instead of to please unit tests related. I would not recommend to use it as a reference.
The only time this doesn't apply is if CPU pinning is also in action (remembering that CPU pinning and NUMA topologies are tightly bound and CPU pinning implies a NUMA topology, much to Jay's consternation). As noted below, our previous advice was not to mix hugepage instances and non-hugepage instances, meaning hosts handling non-hugepage instances should not have hugepages (or should mark the memory consumed by them as reserved for host). We have in effect broken previous behaviour in the name of solving a bug that didn't necessarily have to be fixed yet.
2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
That is an other issue. We report to the resource tracker all the physical memory (small pages + hugepages allocated). The difficulty is that we can't just change the virt driver to report only small pages. Some instances wont be able to get scheduled. We should basically change the resource tracker so it can take into account the different kind of page memory.
Agreed (likely via move tracking of this resource to placement, I assume). It's a longer term fix though.
But it's not really an issue since instances that use "NUMA features" (in Nova world) should be isolated to an aggregate and not be mixed with no-NUMA instances. The reason is simple no-NUMA instances do not have boundaries and break rules of NUMA instances.
Again, we have to be careful not to mix up NUMA and CPU pinning. It's perfectly fine to have NUMA without CPU pinning, though not the other way around. For example:
$ openstack flavor set --property hw:numa_nodes=2 FLAVOR
From what I can tell, there are three reasons that an instance will
have a NUMA topology: the user explicitly requested one, the user requested CPU pinning and got one implicitly, or the user requested a specific pagesize and, again, got one implicitly. We handle the latter two with the advice given below, but I don't think anyone has ever said we must separate instances that had a user-specified NUMA topology from those that had no NUMA topology. If we're going down this path, we need clear docs.
Now I remember why we can't support it. When defining guest NUMA topology (hw:numa_node) the memory is mapped to the assigned host NUMA nodes meaning that the guest memory can't swap out.
If a non-NUMA instance starts using memory from host NUMA nodes used by a guest with NUMA it can result that the guest with NUMA run out of memory and be killed.
On Tue, 2019-01-08 at 15:17 +0100, Sahid Orentino Ferdjaoui wrote: the guest memory should still be able to swap out. we do not memlock the pages when we set hw:numa_nodes we only do that for realtime instances. its down implcitly for hugepages but if you taskset/memtune cores/ram to a host numa node it does not prevent the kernel form paging that memory out to swap space. locked defaults to false in the memory backing https://github.com/openstack/nova/blob/88951ca98e1b286b58aa1ad94f9af40b8260c... and we only set it to true here https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L4... if we take the wantsrealtime branch. that is unrelated to this. that happens because the OOM killer works per numa node and the mempressur value for vms tends to be highre then other processes and the kernel will prefer to kill them over options. a instnace live migration or any other process that add extra memory pressuer can trigger the same effect. part of the issue is the host reserved memory config option is not per numa node but the OOM killer in the kernel is run per numa node.
The implementation is pretty old and it was a first design from scratch, all the situations have not been take into account or been documented. If we want create specific behaviors we are going to add more complexity on something which is already, and which is not completely stable, as an example the patch you have mentioned which has been merged last release.
I agree documenting is probably where we should go; don't try to mix instances with InstanceNUMATopology and without, Nova uses a different way to compute their resources, like don't try to overcommit such instances.
We basically recommend to use aggregate for pinning, realtime, hugepages, so it looks reasonable to add guest NUMA topology to that list.
Stephen
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
Thoughts? Stephen
On Tue, 2019-01-08 at 15:17 +0100, Sahid Orentino Ferdjaoui wrote:
On Tue, Jan 08, 2019 at 12:50:27PM +0100, Sahid Orentino Ferdjaoui wrote:
On Tue, Jan 08, 2019 at 10:47:47AM +0000, Stephen Finucane wrote:
On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote:
On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages
I think that we should not accept any overcommit. Only instances with an InstanceNUMATopology associated pass to this part of check. Such instances want to use features like guest NUMA topology so their memory mapped on host NUMA nodes or CPU pinning. Both cases are used for performance reason and to avoid any cross memory latency.
This issue with this is that we had previously designed everything *to* allow overcommit:
https://github.com/openstack/nova/blob/18.0.0/nova/virt/hardware.py#L1047-L1...
This code never worked Stephen, that instead of to please unit tests related. I would not recommend to use it as a reference.
The only time this doesn't apply is if CPU pinning is also in action (remembering that CPU pinning and NUMA topologies are tightly bound and CPU pinning implies a NUMA topology, much to Jay's consternation). As noted below, our previous advice was not to mix hugepage instances and non-hugepage instances, meaning hosts handling non-hugepage instances should not have hugepages (or should mark the memory consumed by them as reserved for host). We have in effect broken previous behaviour in the name of solving a bug that didn't necessarily have to be fixed yet.
2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
That is an other issue. We report to the resource tracker all the physical memory (small pages + hugepages allocated). The difficulty is that we can't just change the virt driver to report only small pages. Some instances wont be able to get scheduled. We should basically change the resource tracker so it can take into account the different kind of page memory.
Agreed (likely via move tracking of this resource to placement, I assume). It's a longer term fix though.
But it's not really an issue since instances that use "NUMA features" (in Nova world) should be isolated to an aggregate and not be mixed with no-NUMA instances. The reason is simple no-NUMA instances do not have boundaries and break rules of NUMA instances.
Again, we have to be careful not to mix up NUMA and CPU pinning. It's perfectly fine to have NUMA without CPU pinning, though not the other way around. For example:
$ openstack flavor set --property hw:numa_nodes=2 FLAVOR
From what I can tell, there are three reasons that an instance will have a NUMA topology: the user explicitly requested one, the user requested CPU pinning and got one implicitly, or the user requested a specific pagesize and, again, got one implicitly. We handle the latter two with the advice given below, but I don't think anyone has ever said we must separate instances that had a user-specified NUMA topology from those that had no NUMA topology. If we're going down this path, we need clear docs.
Now I remember why we can't support it. When defining guest NUMA topology (hw:numa_node) the memory is mapped to the assigned host NUMA nodes meaning that the guest memory can't swap out. If a non-NUMA instance starts using memory from host NUMA nodes used by a guest with NUMA it can result that the guest with NUMA run out of memory and be killed.
Based on my minimal test, it seems to work just fine? https://bugs.launchpad.net/nova/+bug/1810977 The instances boot with the patch reverted. Is there something I've missed?
The implementation is pretty old and it was a first design from scratch, all the situations have not been take into account or been documented. If we want create specific behaviors we are going to add more complexity on something which is already, and which is not completely stable, as an example the patch you have mentioned which has been merged last release.
I agree documenting is probably where we should go; don't try to mix instances with InstanceNUMATopology and without, Nova uses a different way to compute their resources, like don't try to overcommit such instances.
We basically recommend to use aggregate for pinning, realtime, hugepages, so it looks reasonable to add guest NUMA topology to that list.
Stephen
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
Thoughts? Stephen
On Mon, Jan 07, 2019 at 05:32:32PM +0000, Stephen Finucane wrote:
We've been looking at a patch that landed some months ago and have spotted some issues:
https://review.openstack.org/#/c/532168
In summary, that patch is intended to make the memory check for instances memory pagesize aware. The logic it introduces looks something like this:
If the instance requests a specific pagesize (#1) Check if each host cell can provide enough memory of the pagesize requested for each instance cell Otherwise If the host has hugepages (#2) Check if each host cell can provide enough memory of the smallest pagesize available on the host for each instance cell Otherwise (#3) Check if each host cell can provide enough memory for each instance cell, ignoring pagesizes
This also has the side-effect of allowing instances with hugepages and instances with a NUMA topology but no hugepages to co-exist on the same host, because the latter will now be aware of hugepages and won't consume them. However, there are a couple of issues with this:
1. It breaks overcommit for instances without pagesize request running on hosts with different pagesizes. This is because we don't allow overcommit for hugepages, but case (#2) above means we are now reusing the same functions previously used for actual hugepage checks to check for regular 4k pages
I think that we should not accept any overcommit. Only instances with an InstanceNUMATopology associated pass to this part of check. Such instances want to use features like guest NUMA topology so their memory mapped on host NUMA nodes or CPU pinning. Both cases are used for performance reason and to avoid any cross memory latency.
On Tue, 2019-01-08 at 11:06 +0100, Sahid Orentino Ferdjaoui wrote: that is not nessisarialy correct. if i request cpu pinning that does not imply that i dont want the ablitiy memory over subsricption. that is an artifact of how we chose to implement pinning in the libvirt driver. for the case of cpu pinning specifically i have alway felt it is wrong that we create a numa toplogy for the geust implicitly. in the case of hw:numa_nodes=1 in the absence of any other extra spec or image metadata i also do not think it is correct ot disabel over subsription retroativly after supportin it for several years. requesting a numa toplogy out side of requesting huge pages explcitly shoudl never have disabled over subsription and changing that behavior should have both required a microvirtion and a nova spec. https://review.openstack.org/#/c/532168 was simply a bug fix and therefor shoudl not have changed the meaning of requesting a numa topology.
2. It doesn't fix the issue when non-NUMA instances exist on the same host as NUMA instances with hugepages. The non-NUMA instances don't run through any of the code above, meaning they're still not pagesize aware
That is an other issue. We report to the resource tracker all the physical memory (small pages + hugepages allocated). The difficulty is that we can't just change the virt driver to report only small pages. Some instances wont be able to get scheduled. We should basically change the resource tracker so it can take into account the different kind of page memory.
But it's not really an issue since instances that use "NUMA features" (in Nova world) should be isolated to an aggregate and not be mixed with no-NUMA instances. The reason is simple no-NUMA instances do not have boundaries and break rules of NUMA instances.
it is true that we should partion deployment useing host aggreates for host with numa instance today and host that have non numa instances. the reason issue 2 was raised is that the commit message implied that the patch addressed mixing numa and non numa guest on the same host. "Also when no pagesize is requested we should consider to compute memory usage based on small pages since the amount of physical memory available may also include some large pages." but the logic in the patch does not actully get triggered when the guest does not have numa topology so it does not actully consider the total number of small page in that case. this was linked to a down bugzilla you filed https://bugzilla.redhat.com/show_bug.cgi?id=1625119 and another for osp 10 https://bugzilla.redhat.com/show_bug.cgi?id=1519540 which has 3 costomer cases associated with it that downstream bugs. on closer inspection of the patch dose not address the downstream bug at all as it is expressly stating that nova does not consider small pages when mem_page_size is not set but since we dont execute this code of non numa guest we dont actully resovle the issue. when we consider the costomer cases associated with this specifcally the 3 the donwstream bug claims to resove are 1.) a sheudler race where two pinned instances get shduled to the same set of resouces (this can only be fixed with placement 2.) mixing hugepage and non huge page guest reulted in OOM events 3.) instnace with a numa toplogy nolonger respect ram allocation ration. the third customer issue was directly caused but backporting this patach. the second issue would be resoved by using host aggreates to segerage hugepage host from non numa hosts and the first cant be addressed without premtivly claiming cpu/hugepages in the schduelr/placement.
We could probably fix issue (1) by modifying those hugepage functions we're using to allow overcommit via a flag that we pass for case (#2). We can mitigate issue (2) by advising operators to split hosts into aggregates for 'hw:mem_page_size' set or unset (in addition to 'hw:cpu_policy' set to dedicated or shared/unset). I need to check but I think this may be the case in some docs (sean-k-mooney said Intel used to do this. I don't know about Red Hat's docs or upstream). In addition, we did actually called that out in the original spec:
https://specs.openstack.org/openstack/nova-specs/specs/juno/approved/virt-dr...
However, if we're doing that for non-NUMA instances, one would have to question why the patch is necessary/acceptable for NUMA instances. For what it's worth, a longer fix would be to start tracking hugepages in a non-NUMA aware way too but that's a lot more work and doesn't fix the issue now.
As such, my question is this: should be look at fixing issue (1) and documenting issue (2), or should we revert the thing wholesale until we work on a solution that could e.g. let us track hugepages via placement and resolve issue (2) too.
Thoughts? Stephen
participants (5)
-
Balázs Gibizer
-
Chris Friesen
-
Sahid Orentino Ferdjaoui
-
Sean Mooney
-
Stephen Finucane