[Caracal]Optimizing Nova NUMA Scheduling Behavior (Single vs Dual NUMA Socket Allocation)
Hi Team, We are using Kolla Ansible for our OpenStack Caracal (2024.1) deployment across two production sites. We have observed that the Nova scheduler is currently allocating virtual machines (VMs) only on a single NUMA socket of the compute host. Since our flavors do not include the property hw:numa_nodes='2', the scheduler places all vCPUs and memory for a VM on a single NUMA node, leaving the other NUMA socket(s) underutilized. In our testing, we added the hw:numa_nodes='2' property to the flavor and confirmed that the scheduler can now distribute vCPUs across both NUMA sockets. However, this configuration led to 20–30% memory performance degradation, likely due to cross-NUMA memory access. We also noticed that once this property is set, the scheduler always attempts to spread vCPUs across two NUMA sockets — even when there are sufficient resources available on a single NUMA node. We are looking for a way to achieve conditional NUMA placement — ideally: - If a single NUMA node has enough capacity, schedule the VM entirely on that node. - Otherwise, allow the scheduler to spread the VM across multiple NUMA nodes. We would appreciate any suggestions or best practices to: Improve NUMA-aware scheduling efficiency,Better utilize both NUMA sockets without significant memory performance penalties, andAvoid forcing cross-NUMA allocation unless necessary. Has anyone in the community implemented or tuned Nova in a way that balances these requirements effectively? Thanks in advance for your insights and recommendations. Thanks
On 17/10/2025 19:42, Vish Mudemela wrote:
Hi Team,
We are using Kolla Ansible for our OpenStack Caracal (2024.1) deployment across two production sites. We have observed that the Nova scheduler is currently allocating virtual machines (VMs) only on a single NUMA socket of the compute host.
that s not quite how this work. the schduler only slects a host where the vm can fit, the selection of the cores and memory is doen entirly by the nova compute agent on the compute node.
Since our flavors do not include the property hw:numa_nodes='2', the scheduler places all vCPUs and memory for a VM on a single NUMA node, leaving the other NUMA socket(s) underutilized. In our testing, we added the hw:numa_nodes='2' property to the flavor and confirmed that the scheduler can now distribute vCPUs across both NUMA sockets. However, this configuration led to 20–30% memory performance degradation, likely due to cross-NUMA memory access.
so it sound like you ahve not properly defiend your flavors. first if a vm ever has a numa toploty eigher via hw:numa_nodes directly or indirectly via hw:cpu_policy=dedicated you must set hw:mem_page_size to any supproted value. if you do not the vm will pin pinned to a numa node but numa aware memroy tracking will not be enabled. this will result in your vms being OOM killed and it is unsupprted in nova. not we do not support memory over subscription when using numa affined vms. cpu can still be over subscribed.
We also noticed that once this property is set, the scheduler always attempts to spread vCPUs across two NUMA sockets — even when there are sufficient resources available on a single NUMA node.
the rule for numa in nova are as folows. 1 each guest virtual numa node will be mapped to exactly 1 host numa node 2 each guest virtual numa node will be mapped to a non overlapping host numa node that mean a vm with 2 virtual numa nodes will always be mapped to 2 host numa node 3 the order of gust to host numa nodes is not garenteed and the mapping can change when the vm is moved 4 if you have passthoug pci device and the device report a numa affinity we will require that it is colocated with 1 of the guest numa nodes by default.
We are looking for a way to achieve conditional NUMA placement — ideally:
- If a single NUMA node has enough capacity, schedule the VM entirely on that node.
that is what we do by default when you enable numa affinity with a single virtual numa node. i.e. hw:numa_nodes=1 or hw:mem_page_size=small or hw:cpu_policy=dedicated as i said above hw:mem_page_size=<any allowed value> is always required if you are ever using any numa feature. note that you can use https://docs.openstack.org/nova/latest/configuration/config.html#compute.pac... to customize how nova select the numa node that will be selected by default nova will select the first numa node that fits the workload filling hte host numa node sequentially setting ``` [compute] packing_host_numa_cells_allocation_strategy=true ``` will spread the vms on the host based on aviable resources blancing the vm placement across the host numa nodes instead of packing the numa nodes. again this only work if you properly configure numa by specifying hw:mem_page_size if you just have hw:numa_node=1 both the numa aware cpu allocation and numa aware memory allocation will be disabled and the vms will all be pinned to the first numa node even if that would violate the allocation ratio as without enable the numa aware memory allocation the scheduler only check the global host memory capasity.
- Otherwise, allow the scheduler to spread the VM across multiple NUMA nodes.
this is explcitly not allow for numa affined vms, however this is the default behavior when not using any numa feature. not nova does not support mixing numa aware vms and non numa aware vms on the same host.] if you have both numa and non numa flavor in yoru cloud you need to use host aggregates or other schduling methods to partion the cloud so that numa and non numa vm never are allowed on the same host.
We would appreciate any suggestions or best practices to: Improve NUMA-aware scheduling efficiency,Better utilize both NUMA sockets without significant memory performance penalties, andAvoid forcing cross-NUMA allocation unless necessary.
Has anyone in the community implemented or tuned Nova in a way that balances these requirements effectively?
Thanks in advance for your insights and recommendations.
Thanks
participants (2)
-
Sean Mooney
-
Vish Mudemela