Re: weighing compute nodes by real memory available when filtering

13 Mar 2025

      On 13/03/2025 04:34, Jiatong Shen wrote:
...
Hello Experts,
I would like to ask if there is an available implementation to achieve 
filtering or weighing compute nodes by the real memory usage.
no there is not.

this is generally not required in an openstack cloud if you use the 
configuration options already available.

you can use the ram weigher today to prefer host with more free ram but 
its based on the amount of ram allcoated to the vms.

https://docs.openstack.org/nova/latest/configuration/config.html#filter_sche...

```
[filter_scheduler]
ram_weight_multiplier=100

```

if you set that to a large positive value the scheduler will prefer to 
aggressively spread based on unallcoated ram
i.e. preferring host with the most free ram

that is the value i generally recommend tuning on most deployment to 
avoid ram contention.
...
From my observation, nova filters the compute-nodes by computing the 
summation of all instances' flavor ram and reserved ram.
that a simplified view but mostly correct we have some more advanced 
placement logic with regard to numa affinity as well but

conceptually we have architect the scheduler and the placement service 
to be based on allocating resources form static inventories.

one of the most common pitfall people have related to OOM events is tehy 
enable cpu pinning without turning on the numa aware memory
tracking in nova

i.e. hw:cpu_policy=dedicated without hw:mem_page_size

unless you are using file backed memory its always an error to turn on 
cpu pinning without specifying a mem_page_size in
the flavor.
...
But since reserved ram is only an empirical value, sometimes 
compute nodes could potentially use more ram. Making things even worse 
is the ram allocation ratio could be more than 1.
setting it more then one requries that you allcoate enough swap space to 
fully account for the over allocation
we recommend that all compute nodes always have a small amout of swap in 
general even if memory overallcation is disabled or set to <1.0 because 
not having swap change how python/malloc allocats memory and actully 
tends to incresse the usage.
when over allocation is enabeld we recommend allocation 
ram*allocation_ratio MB of swap
the reasons for this is if you heveialy over allocate you need to have 
enouch ram to keep the QEUM memory allocated to the qemu process + the 
active memory of the guest, os and python agent in ram.

while nova does not enforce this we strictly speaking do not support 
over allocation without swap or file backed memory and i would not
deploy nova in production without at least 4G of swap.

if you do not use memory over allcoatin by the way my advice is to 
alwasy enable hugepages hw:mem_page_size=large.

the only reasnon not to follow that adivce is if you want to create 
flavor that do not fit in a host numa node or flavor that
do not pack well. generally the performace imporment form using 2MB 
hugepages out weigh that concern but it is a valid reason not
to follow this advice. i generally advice against using 1G hugepages to 
make live migration more reliable/efficent.
...
So I would like to know if there is an available implementation such 
that when the real memory available (from procfs) is low, the 
compute-node could be given a smaller weight or simply filtered out.
there is nothing to do this today.

we intentially removed the ram filter as we now do this in placment.
while in theory the metric weigher could be enhanced to do this we 
rejected that in the past because of the performance problems
with metrics collection when done by nova. its not impossible but we 
have tried to move to a static inventory model.

for ram today we atomicly reserve allocation of ram for the instance 
in placement service form mostly static inventories that are reported by
the comptue nodes.

we have tried to move away form approch that look at the actual memory 
free on the host or dynamically changing values in general.

i.e. we have tried ot move away form approchs where the amount of a 
resource used depend on atributes of the host it lands one.

the simple fact is the amount of memory used by an instnace depends on 
the verion of qemu used, your kernel, whether you use ovs or a different 
network backend and other factors that are hard to account for. we tend 
ot lump all of that together and call it the qemu overhead although it
not  really just the addtional qemu memory allcoations.

that does not mean we cant improve nova to help with this usecuase going 
forward but not by looking at the live data.

what i would like to do in the future is steal some not so terrible 
ideas form how k8s solves this
namely i want to investigate using nova or watcher to annotate compute 
node resource providers with Traits or something similar
that model memory_pressure, disk_pressure, cpu_pressure ectra and adding 
a weigher that will avoid hosts with those traits
but not forbid there use.

for the memory overhead usecase (i.e. the un accounted for memory that 
is ued by qemu) the solution today has generally been to use
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.res... 
to reserve enough memory based on the
maximum number if instance you expect and measuring the overhead in your 
environment.

instead of doing it statically another approach might be to add a flavor 
extra spec to allow multiplying the flavor.ram by a constant value
so we reserve more ram then is allocated to the guest ram. that is not a 
perfect solution but it would allow you more granularity
to fine tune your memory reservation based on what type of vms are 
actually deployed rather then having to try and plan for the worst case.
...
Thank you.
--
Best Regards,
Jiatong Shen

Re: weighing compute nodes by real memory available when filtering

Sean Mooney