[ops][nova][placement] NUMA topology vs non-NUMA workloads
This message is primarily addressed at operators, and of those, operators who are interested in effectively managing and mixing workloads that care about NUMA with workloads that do not. There are some questions within, after some background to explain the issue. At the PTG, Nova and Placement developers made a commitment to more effectively manage NUMA topologies within Nova and Placement. On the placement side this resulted in a spec which proposed several features that would enable more expressive queries when requesting allocation candidates (places for workloads to go), resulting in fewer late scheduling failures. At first there was one spec that discussed all the features. This morning it was split in two because one of the features is proving hard to resolve. Those two specs can be found at: * https://review.opendev.org/658510 (has all the original discussion) * https://review.opendev.org/662191 (the less contentious features split out) After much discussion, we would prefer to not do the feature discussed in 658510. Called 'can_split', it would allow specified classes of resource (notably VCPU and memory) to be split across multiple numa nodes when each node can only contribute a portion of the required resources and where those resources are modelled as inventory on the NUMA nodes, not the host at large. While this is a good idea in principle it turns out (see the spec) to cause many issues that require changes throughout the ecosystem, for example enforcing pinned cpus for workloads that would normally float. It's possible to make the changes, but it would require additional contributors to join the effort, both in terms of writing the code and understanding the many issues. So the questions: * How important, in your cloud, is it to co-locate guests needing a NUMA topology with guests that do not? A review of documentation (upstream and vendor) shows differing levels of recommendation on this, but in many cases the recommendation is to not do it. * If your answer to the above is "we must be able to do that": How important is it that your cloud be able to pack workloads as tight as possible? That is: If there are two NUMA nodes and each has 2 VCPU free, should a 4 VCPU demanding non-NUMA workload be able to land there? Or would you prefer that not happen? * If the answer to the first question is "we can get by without that" is it satisfactory to be able to configure some hosts as NUMA aware and others as not, as described in the "NUMA topology with RPs" spec [1]? In this set up some non-NUMA workloads could end up on a NUMA host (unless otherwise excluded by traits or aggregates), but only when there was contiguous resource available. This latter question articulates the current plan unless responses to this message indicate it simply can't work or legions of assistance shows up. Note that even if we don't do can_split, we'll still be enabling significant progress with the other features described in the second spec [2]. Thanks for your help in moving us in the right direction. [1] https://review.opendev.org/552924 [2] https://review.opendev.org/662191 -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent
Chris, From the CERN set up, I think there are dedicated cells for NUMA optimised configurations (but maybe one of the engineers on the team could confirm to be sure) Q: How important, in your cloud, is it to co-locate guests needing a NUMA topology with guests that do not? A review of documentation (upstream and vendor) shows differing levels of recommendation on this, but in many cases the recommendation is to not do it. A: no co-location currently Q: If your answer to the above is "we must be able to do that": How important is it that your cloud be able to pack workloads as tight as possible? That is: If there are two NUMA nodes and each has 2 VCPU free, should a 4 VCPU demanding non-NUMA workload be able to land there? Or would you prefer that not happen? A: not applicable Q: If the answer to the first question is "we can get by without that" is it satisfactory to be able to configure some hosts as NUMA aware and others as not, as described in the "NUMA topology with RPs" spec [1]? In this set up some non-NUMA workloads could end up on a NUMA host (unless otherwise excluded by traits or aggregates), but only when there was contiguous resource available. A: I think this would be OK Tim -----Original Message----- From: Chris Dent <cdent+os@anticdent.org> Reply-To: "openstack-discuss@lists.openstack.org" <openstack-discuss@lists.openstack.org> Date: Thursday, 30 May 2019 at 14:57 To: "OpenStack-discuss@lists.openstack.org" <OpenStack-discuss@lists.openstack.org> Subject: [ops][nova][placement] NUMA topology vs non-NUMA workloads This message is primarily addressed at operators, and of those, operators who are interested in effectively managing and mixing workloads that care about NUMA with workloads that do not. There are some questions within, after some background to explain the issue. At the PTG, Nova and Placement developers made a commitment to more effectively manage NUMA topologies within Nova and Placement. On the placement side this resulted in a spec which proposed several features that would enable more expressive queries when requesting allocation candidates (places for workloads to go), resulting in fewer late scheduling failures. At first there was one spec that discussed all the features. This morning it was split in two because one of the features is proving hard to resolve. Those two specs can be found at: * https://review.opendev.org/658510 (has all the original discussion) * https://review.opendev.org/662191 (the less contentious features split out) After much discussion, we would prefer to not do the feature discussed in 658510. Called 'can_split', it would allow specified classes of resource (notably VCPU and memory) to be split across multiple numa nodes when each node can only contribute a portion of the required resources and where those resources are modelled as inventory on the NUMA nodes, not the host at large. While this is a good idea in principle it turns out (see the spec) to cause many issues that require changes throughout the ecosystem, for example enforcing pinned cpus for workloads that would normally float. It's possible to make the changes, but it would require additional contributors to join the effort, both in terms of writing the code and understanding the many issues. So the questions: * How important, in your cloud, is it to co-locate guests needing a NUMA topology with guests that do not? A review of documentation (upstream and vendor) shows differing levels of recommendation on this, but in many cases the recommendation is to not do it. * If your answer to the above is "we must be able to do that": How important is it that your cloud be able to pack workloads as tight as possible? That is: If there are two NUMA nodes and each has 2 VCPU free, should a 4 VCPU demanding non-NUMA workload be able to land there? Or would you prefer that not happen? * If the answer to the first question is "we can get by without that" is it satisfactory to be able to configure some hosts as NUMA aware and others as not, as described in the "NUMA topology with RPs" spec [1]? In this set up some non-NUMA workloads could end up on a NUMA host (unless otherwise excluded by traits or aggregates), but only when there was contiguous resource available. This latter question articulates the current plan unless responses to this message indicate it simply can't work or legions of assistance shows up. Note that even if we don't do can_split, we'll still be enabling significant progress with the other features described in the second spec [2]. Thanks for your help in moving us in the right direction. [1] https://review.opendev.org/552924 [2] https://review.opendev.org/662191 -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent
On 31.05.19 09:51, Tim Bell wrote:
Chris,
From the CERN set up, I think there are dedicated cells for NUMA optimised configurations (but maybe one of the engineers on the team could confirm to be sure)
This is correct: we have dedicated cells for NUMA aware guests (and hence do not mix NUMA aware and NUMA unaware guests on the same set of hosts). Arne
Q: How important, in your cloud, is it to co-locate guests needing a NUMA topology with guests that do not? A review of documentation (upstream and vendor) shows differing levels of recommendation on this, but in many cases the recommendation is to not do it.
A: no co-location currently
Q: If your answer to the above is "we must be able to do that": How important is it that your cloud be able to pack workloads as tight as possible? That is: If there are two NUMA nodes and each has 2 VCPU free, should a 4 VCPU demanding non-NUMA workload be able to land there? Or would you prefer that not happen?
A: not applicable
Q: If the answer to the first question is "we can get by without that" is it satisfactory to be able to configure some hosts as NUMA aware and others as not, as described in the "NUMA topology with RPs" spec [1]? In this set up some non-NUMA workloads could end up on a NUMA host (unless otherwise excluded by traits or aggregates), but only when there was contiguous resource available.
A: I think this would be OK
Tim -----Original Message----- From: Chris Dent <cdent+os@anticdent.org> Reply-To: "openstack-discuss@lists.openstack.org" <openstack-discuss@lists.openstack.org> Date: Thursday, 30 May 2019 at 14:57 To: "OpenStack-discuss@lists.openstack.org" <OpenStack-discuss@lists.openstack.org> Subject: [ops][nova][placement] NUMA topology vs non-NUMA workloads
This message is primarily addressed at operators, and of those, operators who are interested in effectively managing and mixing workloads that care about NUMA with workloads that do not. There are some questions within, after some background to explain the issue.
At the PTG, Nova and Placement developers made a commitment to more effectively manage NUMA topologies within Nova and Placement. On the placement side this resulted in a spec which proposed several features that would enable more expressive queries when requesting allocation candidates (places for workloads to go), resulting in fewer late scheduling failures.
At first there was one spec that discussed all the features. This morning it was split in two because one of the features is proving hard to resolve. Those two specs can be found at:
* https://review.opendev.org/658510 (has all the original discussion) * https://review.opendev.org/662191 (the less contentious features split out)
After much discussion, we would prefer to not do the feature discussed in 658510. Called 'can_split', it would allow specified classes of resource (notably VCPU and memory) to be split across multiple numa nodes when each node can only contribute a portion of the required resources and where those resources are modelled as inventory on the NUMA nodes, not the host at large.
While this is a good idea in principle it turns out (see the spec) to cause many issues that require changes throughout the ecosystem, for example enforcing pinned cpus for workloads that would normally float. It's possible to make the changes, but it would require additional contributors to join the effort, both in terms of writing the code and understanding the many issues.
So the questions:
* How important, in your cloud, is it to co-locate guests needing a NUMA topology with guests that do not? A review of documentation (upstream and vendor) shows differing levels of recommendation on this, but in many cases the recommendation is to not do it.
* If your answer to the above is "we must be able to do that": How important is it that your cloud be able to pack workloads as tight as possible? That is: If there are two NUMA nodes and each has 2 VCPU free, should a 4 VCPU demanding non-NUMA workload be able to land there? Or would you prefer that not happen?
* If the answer to the first question is "we can get by without that" is it satisfactory to be able to configure some hosts as NUMA aware and others as not, as described in the "NUMA topology with RPs" spec [1]? In this set up some non-NUMA workloads could end up on a NUMA host (unless otherwise excluded by traits or aggregates), but only when there was contiguous resource available.
This latter question articulates the current plan unless responses to this message indicate it simply can't work or legions of assistance shows up. Note that even if we don't do can_split, we'll still be enabling significant progress with the other features described in the second spec [2].
Thanks for your help in moving us in the right direction.
[1] https://review.opendev.org/552924 [2] https://review.opendev.org/662191 -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent
participants (3)
-
Arne Wiebalck
-
Chris Dent
-
Tim Bell