On 13/03/2025 04:34, Jiatong Shen wrote:
Hello Experts,
I would like to ask if there is an available implementation to achieve filtering or weighing compute nodes by the real memory usage.
no there is not. this is generally not required in an openstack cloud if you use the configuration options already available. you can use the ram weigher today to prefer host with more free ram but its based on the amount of ram allcoated to the vms. https://docs.openstack.org/nova/latest/configuration/config.html#filter_sche... ``` [filter_scheduler] ram_weight_multiplier=100 ``` if you set that to a large positive value the scheduler will prefer to aggressively spread based on unallcoated ram i.e. preferring host with the most free ram that is the value i generally recommend tuning on most deployment to avoid ram contention.
From my observation, nova filters the compute-nodes by computing the summation of all instances' flavor ram and reserved ram.
that a simplified view but mostly correct we have some more advanced placement logic with regard to numa affinity as well but conceptually we have architect the scheduler and the placement service to be based on allocating resources form static inventories. one of the most common pitfall people have related to OOM events is tehy enable cpu pinning without turning on the numa aware memory tracking in nova i.e. hw:cpu_policy=dedicated without hw:mem_page_size unless you are using file backed memory its always an error to turn on cpu pinning without specifying a mem_page_size in the flavor.
But since reserved ram is only an empirical value, sometimes compute nodes could potentially use more ram. Making things even worse is the ram allocation ratio could be more than 1.
setting it more then one requries that you allcoate enough swap space to fully account for the over allocation we recommend that all compute nodes always have a small amout of swap in general even if memory overallcation is disabled or set to <1.0 because not having swap change how python/malloc allocats memory and actully tends to incresse the usage. when over allocation is enabeld we recommend allocation ram*allocation_ratio MB of swap the reasons for this is if you heveialy over allocate you need to have enouch ram to keep the QEUM memory allocated to the qemu process + the active memory of the guest, os and python agent in ram. while nova does not enforce this we strictly speaking do not support over allocation without swap or file backed memory and i would not deploy nova in production without at least 4G of swap. if you do not use memory over allcoatin by the way my advice is to alwasy enable hugepages hw:mem_page_size=large. the only reasnon not to follow that adivce is if you want to create flavor that do not fit in a host numa node or flavor that do not pack well. generally the performace imporment form using 2MB hugepages out weigh that concern but it is a valid reason not to follow this advice. i generally advice against using 1G hugepages to make live migration more reliable/efficent.
So I would like to know if there is an available implementation such that when the real memory available (from procfs) is low, the compute-node could be given a smaller weight or simply filtered out.
there is nothing to do this today. we intentially removed the ram filter as we now do this in placment. while in theory the metric weigher could be enhanced to do this we rejected that in the past because of the performance problems with metrics collection when done by nova. its not impossible but we have tried to move to a static inventory model. for ram today we atomicly reserve allocation of ram for the instance in placement service form mostly static inventories that are reported by the comptue nodes. we have tried to move away form approch that look at the actual memory free on the host or dynamically changing values in general. i.e. we have tried ot move away form approchs where the amount of a resource used depend on atributes of the host it lands one. the simple fact is the amount of memory used by an instnace depends on the verion of qemu used, your kernel, whether you use ovs or a different network backend and other factors that are hard to account for. we tend ot lump all of that together and call it the qemu overhead although it not really just the addtional qemu memory allcoations. that does not mean we cant improve nova to help with this usecuase going forward but not by looking at the live data. what i would like to do in the future is steal some not so terrible ideas form how k8s solves this namely i want to investigate using nova or watcher to annotate compute node resource providers with Traits or something similar that model memory_pressure, disk_pressure, cpu_pressure ectra and adding a weigher that will avoid hosts with those traits but not forbid there use. for the memory overhead usecase (i.e. the un accounted for memory that is ued by qemu) the solution today has generally been to use https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.res... to reserve enough memory based on the maximum number if instance you expect and measuring the overhead in your environment. instead of doing it statically another approach might be to add a flavor extra spec to allow multiplying the flavor.ram by a constant value so we reserve more ram then is allocated to the guest ram. that is not a perfect solution but it would allow you more granularity to fine tune your memory reservation based on what type of vms are actually deployed rather then having to try and plan for the worst case.
Thank you. --
Best Regards,
Jiatong Shen