On Thu, Mar 13, 2025 at 8:17 PM Sean Mooney <smooney@redhat.com> wrote:

On 13/03/2025 04:34, Jiatong Shen wrote:
> Hello Experts,
>
> I would like to ask if there is an available implementation to achieve
> filtering or weighing compute nodes by the real memory usage.

no there is not.

this is generally not required in an openstack cloud if you use the
configuration options already available.

you can use the ram weigher today to prefer host with more free ram but
its based on the amount of ram allcoated to the vms.

https://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.ram_weight_multiplier

```
[filter_scheduler]
ram_weight_multiplier=100

```

if you set that to a large positive value the scheduler will prefer to
aggressively spread based on unallcoated ram
i.e. preferring host with the most free ram

that is the value i generally recommend tuning on most deployment to
avoid ram contention.

> From my observation, nova filters the compute-nodes by computing the
> summation of all instances' flavor ram and reserved ram.

that a simplified view but mostly correct we have some more advanced
placement logic with regard to numa affinity as well but

conceptually we have architect the scheduler and the placement service
to be based on allocating resources form static inventories.

one of the most common pitfall people have related to OOM events is tehy
enable cpu pinning without turning on the numa aware memory
tracking in nova

i.e. hw:cpu_policy=dedicated without hw:mem_page_size

unless you are using file backed memory its always an error to turn on
cpu pinning without specifying a mem_page_size in
the flavor.

> But since reserved ram is only an empirical value, sometimes
> compute nodes could potentially use more ram. Making things even worse
> is the ram allocation ratio could be more than 1.

setting it more then one requries that you allcoate enough swap space to
fully account for the over allocation
we recommend that all compute nodes always have a small amout of swap in
general even if memory overallcation is disabled or set to <1.0 because
not having swap change how python/malloc allocats memory and actully
tends to incresse the usage.
when over allocation is enabeld we recommend allocation
ram*allocation_ratio MB of swap
the reasons for this is if you heveialy over allocate you need to have
enouch ram to keep the QEUM memory allocated to the qemu process + the
active memory of the guest, os and python agent in ram.

while nova does not enforce this we strictly speaking do not support
over allocation without swap or file backed memory and i would not
deploy nova in production without at least 4G of swap.

if you do not use memory over allcoatin by the way my advice is to
alwasy enable hugepages hw:mem_page_size=large.

the only reasnon not to follow that adivce is if you want to create
flavor that do not fit in a host numa node or flavor that
do not pack well. generally the performace imporment form using 2MB
hugepages out weigh that concern but it is a valid reason not
to follow this advice. i generally advice against using 1G hugepages to
make live migration more reliable/efficent.

>
> So I would like to know if there is an available implementation such
> that when the real memory available (from procfs) is low, the
> compute-node could be given a smaller weight or simply filtered out.

there is nothing to do this today.

we intentially removed the ram filter as we now do this in placment.
while in theory the metric weigher could be enhanced to do this we
rejected that in the past because of the performance problems
with metrics collection when done by nova. its not impossible but we
have tried to move to a static inventory model.

for ram today we atomicly reserve allocation of ram for the instance
in placement service form mostly static inventories that are reported by
the comptue nodes.

we have tried to move away form approch that look at the actual memory
free on the host or dynamically changing values in general.

i.e. we have tried ot move away form approchs where the amount of a
resource used depend on atributes of the host it lands one.

the simple fact is the amount of memory used by an instnace depends on
the verion of qemu used, your kernel, whether you use ovs or a different
network backend and other factors that are hard to account for. we tend
ot lump all of that together and call it the qemu overhead although it
not really just the addtional qemu memory allcoations.

that does not mean we cant improve nova to help with this usecuase going
forward but not by looking at the live data.

what i would like to do in the future is steal some not so terrible
ideas form how k8s solves this
namely i want to investigate using nova or watcher to annotate compute
node resource providers with Traits or something similar
that model memory_pressure, disk_pressure, cpu_pressure ectra and adding
a weigher that will avoid hosts with those traits
but not forbid there use.

for the memory overhead usecase (i.e. the un accounted for memory that
is ued by qemu) the solution today has generally been to use
https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.reserved_host_memory_mb
to reserve enough memory based on the
maximum number if instance you expect and measuring the overhead in your
environment.

instead of doing it statically another approach might be to add a flavor
extra spec to allow multiplying the flavor.ram by a constant value
so we reserve more ram then is allocated to the guest ram. that is not a
perfect solution but it would allow you more granularity
to fine tune your memory reservation based on what type of vms are
actually deployed rather then having to try and plan for the worst case.

>
> Thank you.
> --
>
> Best Regards,
>
> Jiatong Shen