[nova] NUMA scheduling

Sean Mooney smooney at redhat.com
Tue Oct 20 04:53:51 UTC 2020


On Mon, 2020-10-19 at 20:38 -0500, Eric K. Miller wrote:
> > hw:numa_nodes=1 does not enable per numa node memory tracking
> > to resolve your OOM issue you need to set hw:mem_page_size=small or
> > hw:mem_page_size=any
> 
> Ah!  That's what I was looking for! :)  Thank you Sean!
> 
> > the reason that it is always selection  numa 0 is that nova is taking the list of
> > host numa nodes and checkign each one using itertools.permutations.
> > that always checks the numa nodes in a stable order starting with numa node
> > 0.
> > 
> > since you have jsut set hw:numa_nodes=1 without requesting any numa
> > specifc resouces e.g. memory or cpus  numa node 0 will effectivly always fit
> > the
> > vm
> 
> Makes sense.
> 
> > when you set hw:numa_nodes=1 and nothing else the scudler will only
> > reject a node if the number of cpus on the numa node is less that the
> > number the
> > vm requests. it will not check the memory availabel on the numa node since
> > you did not ask nova to do that via hw:mem_page_size
> > 
> > effectivly if you are using any numa feature in nova and do not set
> > hw:mem_page_size then your flavor is misconfigured as it will not request
> > numa
> > local memory trackign to be enabled.
> 
> Good to know.
> 
> So it sounds like by setting the hw:mem_page_size parameter (probably best to choose "small" as a general default), NUMA node 0 will fill up, and
> then NUMA node 1 will be considered.  In other words, VMs will NOT be provisioned in a "round-robin" fashion between NUMA nodes.  Do I understand
> that correctly?

yes you do https://bugs.launchpad.net/nova/+bug/1893121 basically tracks this. i fundemetally belive this is a performance bug
not a feature althogh others disagree. This is why we have always recommended you set hw:numa_nodes to the number of numa nodes on the host if you
can. the excpetion to that is for workloads that dont support numa awareness in which case you shoudl only deviate form this advice if you mesusre
a perfromace degreadation. with the default behavior you will saturate once numa node before usign the other this
lead to pessimising your memory badnwidth and cpu perfromacne. tl;dr since all the vms are being packed on the first numa node/socket it will
effectivly leave the scond socket ideal and load up the first socket causeing the process to not be able to turbo bost as agressivly as if you had
spread the vms between each numa node evenly.

as you get to higher utilisation that is less of an issue but its a non zero effect on lightly utilised cloud sicne we spread host by default
and is potentially less energy effincet as the termal load will also not be spread between the cpus.
> 
> > you do not need to use hugepages but you do need to enable per numa
> > node memory tracking with hw:mem_page_size=small (use non hugepage
> > typicaly 4k
> > pages) or hw:mem_page_size=any  which basicaly is the same as small but
> > the image can requst hugepages if it wantws too. if you  set small in the
> > flaor but large in the image that is an error. if you set any in the falvor the
> > image can set any value it like like small or large or an explcit page
> > size and the schduler will honour that.
> > 
> > if you know you wnat the flavor to use small pages then you shoudl just set
> > small explictly.
> 
> Also good to know.  Thanks again!
i have added this topic to the ptg etherpad in https://etherpad.opendev.org/p/nova-wallaby-ptg line 170 ish as part of the numa in placement section.

> 
> Eric





More information about the openstack-discuss mailing list