[nova] hw:numa_nodes question

newer
[charms] OpenStack Charms 2023.1...

older
RDO Ice Cream Social at OpenInfra...

hai wu

10 May 2023 10 May '23

10:27 a.m.

Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as long as that flavor can fit into one numa node?

I don't think so. ~~~ The most common case will be that the admin only sets hw:numa_nodes and then the flavor vCPUs and memory will be divided equally across the NUMA nodes. When a NUMA policy is in effect, it is mandatory for the instance's memory allocations to come from the NUMA nodes to which it is bound except where overriden by hw:numa_mem.NN. ~~~ Here are the implementation documents since Juno release: https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... ? On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote:

...

Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as long as that flavor can fit into one numa node?

-- Alvaro Soto *Note: My work hours may not be your work hours. Please do not feel the need to respond during a time that is not convenient for you.* ---------------------------------------------------------- Great people talk about ideas, ordinary people talk about things, small people talk... about other people.

Alvaro Soto

10:52 a.m.

Another good resource =) https://that.guru/blog/cpu-resources/ On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote:

...

I don't think so.

~~~ The most common case will be that the admin only sets hw:numa_nodes and then the flavor vCPUs and memory will be divided equally across the NUMA nodes. When a NUMA policy is in effect, it is mandatory for the instance's memory allocations to come from the NUMA nodes to which it is bound except where overriden by hw:numa_mem.NN. ~~~

Here are the implementation documents since Juno release:

https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem...

https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... ?

On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote:

...
Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as long as that flavor can fit into one numa node?

--

Alvaro Soto

*Note: My work hours may not be your work hours. Please do not feel the need to respond during a time that is not convenient for you.* ---------------------------------------------------------- Great people talk about ideas, ordinary people talk about things, small people talk... about other people.

Sean Mooney

11:45 a.m.

if you set hw:numa_nodes there are two things you should keep in mind first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 then hw:mem_page_size shoudl also be defiend on the falvor. if you dont set hw:mem_page_size then the vam will be pinned to a host numa node but the avaible memory on the host numa node will not be taken into account only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper in the kernel. i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any small will use your kernels default page size for guest memory, typically this is 4k pages large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small and having a seperate flavor for hugepages. its really up to you. the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 disables memory oversubsctipion. so you will not be able ot oversubscibe the memory on the host. in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio if you are using numa affinity. https://that.guru/blog/the-numa-scheduling-story-in-nova/ and https://that.guru/blog/cpu-resources-redux/ are also good to read i do not think stephen has a dedicated block on the memory aspect but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting hw:numa_nodes=1 will casue. if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or hw_mem_page_size set in the image then that vm is not configure properly. On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote:

...

Another good resource =)

https://that.guru/blog/cpu-resources/

On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote:

...
I don't think so.

~~~ The most common case will be that the admin only sets hw:numa_nodes and then the flavor vCPUs and memory will be divided equally across the NUMA nodes. When a NUMA policy is in effect, it is mandatory for the instance's memory allocations to come from the NUMA nodes to which it is bound except where overriden by hw:numa_mem.NN. ~~~

Here are the implementation documents since Juno release:

https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem...

https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... ?

On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote:

...
Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as long as that flavor can fit into one numa node?

--

Alvaro Soto

*Note: My work hours may not be your work hours. Please do not feel the need to respond during a time that is not convenient for you.* ---------------------------------------------------------- Great people talk about ideas, ordinary people talk about things, small people talk... about other people.

hai wu

12:22 p.m.

So there's no default value assumed/set for hw:mem_page_size for each flavor? Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical when using hw:numa_nodes=1. I did not hit an issue with 'hw:mem_page_size' not set, maybe I am missing some known test cases? It would be very helpful to have a test case where I could reproduce this issue with 'hw:numa_nodes=1' being set, but without 'hw:mem_page_size' being set. How to ensure this one for existing vms already running with 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote:

...

if you set hw:numa_nodes there are two things you should keep in mind

first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 then hw:mem_page_size shoudl also be defiend on the falvor.

if you dont set hw:mem_page_size then the vam will be pinned to a host numa node but the avaible memory on the host numa node will not be taken into account

only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper in the kernel.

i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any small will use your kernels default page size for guest memory, typically this is 4k pages large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) and any will use small pages but allow the guest to request hugepages via the hw_page_size image property.

hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small and having a seperate flavor for hugepages. its really up to you.

the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 disables memory oversubsctipion.

so you will not be able ot oversubscibe the memory on the host.

in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio if you are using numa affinity.

https://that.guru/blog/the-numa-scheduling-story-in-nova/ and https://that.guru/blog/cpu-resources-redux/

are also good to read

i do not think stephen has a dedicated block on the memory aspect but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting hw:numa_nodes=1 will casue.

if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or hw_mem_page_size set in the image then that vm is not configure properly.

On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote:

...
Another good resource =)

https://that.guru/blog/cpu-resources/

On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote:

...
I don't think so.

~~~ The most common case will be that the admin only sets hw:numa_nodes and then the flavor vCPUs and memory will be divided equally across the NUMA nodes. When a NUMA policy is in effect, it is mandatory for the instance's memory allocations to come from the NUMA nodes to which it is bound except where overriden by hw:numa_mem.NN. ~~~

Here are the implementation documents since Juno release:

https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem...

https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... ?

On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote:

...
Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as long as that flavor can fit into one numa node?

--

Alvaro Soto

*Note: My work hours may not be your work hours. Please do not feel the need to respond during a time that is not convenient for you.* ---------------------------------------------------------- Great people talk about ideas, ordinary people talk about things, small people talk... about other people.

Sean Mooney

12:45 p.m.

On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote:

...

So there's no default value assumed/set for hw:mem_page_size for each flavor?

correct this is a known edgecase in the currnt design hw:mem_page_size=any would be a resonable default but techinially if just set hw:numa_nodes=1 nova allow memory over subscription in pratch if you try to do that you will almost always end up with vms being killed due to OOM events. so from a api point of view it woudl be a change of behvior for use to default to hw:mem_page_size=any but i think it would be the correct thign to do for operators in the long run. i could bring this up with the core team again but in the past we decided to be conservitive and just warn peopel to alwasy set hw:mem_page_size if using numa affinity.

...

Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical when using hw:numa_nodes=1.

I did not hit an issue with 'hw:mem_page_size' not set, maybe I am missing some known test cases? It would be very helpful to have a test case where I could reproduce this issue with 'hw:numa_nodes=1' being set, but without 'hw:mem_page_size' being set.

How to ensure this one for existing vms already running with 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? you unfortuletly need to resize the instance. tehre are some image porpeties you can set on an instance via nova-manage but you cannot use nova-mange to update the enbedd flavor and set this.

so you need to define a new flavour and resize. this is the main reason we have not changed the default as it may requrie you to move instnace around if there placement is now invalid now that per numa node memory allocatons are correctly being accounted for. if it was simple to change the default without any enduser or operator impact we would.

...

On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
if you set hw:numa_nodes there are two things you should keep in mind

first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 then hw:mem_page_size shoudl also be defiend on the falvor.

if you dont set hw:mem_page_size then the vam will be pinned to a host numa node but the avaible memory on the host numa node will not be taken into account

only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper in the kernel.

i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any small will use your kernels default page size for guest memory, typically this is 4k pages large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) and any will use small pages but allow the guest to request hugepages via the hw_page_size image property.

hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small and having a seperate flavor for hugepages. its really up to you.

the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 disables memory oversubsctipion.

so you will not be able ot oversubscibe the memory on the host.

in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio if you are using numa affinity.

https://that.guru/blog/the-numa-scheduling-story-in-nova/ and https://that.guru/blog/cpu-resources-redux/

are also good to read

i do not think stephen has a dedicated block on the memory aspect but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting hw:numa_nodes=1 will casue.

if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or hw_mem_page_size set in the image then that vm is not configure properly.

On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote:

...
Another good resource =)

https://that.guru/blog/cpu-resources/

On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote:

...
I don't think so.

~~~ The most common case will be that the admin only sets hw:numa_nodes and then the flavor vCPUs and memory will be divided equally across the NUMA nodes. When a NUMA policy is in effect, it is mandatory for the instance's memory allocations to come from the NUMA nodes to which it is bound except where overriden by hw:numa_mem.NN. ~~~

Here are the implementation documents since Juno release:

https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem...

https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... ?

On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote:

...
Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as long as that flavor can fit into one numa node?

--

Alvaro Soto

*Note: My work hours may not be your work hours. Please do not feel the need to respond during a time that is not convenient for you.* ---------------------------------------------------------- Great people talk about ideas, ordinary people talk about things, small people talk... about other people.

hai wu

1:06 p.m.

Is it possible to update something in the Openstack database for the relevant VMs in order to do the same, and then hard reboot the VM so that the VM would have this attribute? On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote:

...

On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote:

...
So there's no default value assumed/set for hw:mem_page_size for each flavor?

correct this is a known edgecase in the currnt design hw:mem_page_size=any would be a resonable default but techinially if just set hw:numa_nodes=1 nova allow memory over subscription

in pratch if you try to do that you will almost always end up with vms being killed due to OOM events.

so from a api point of view it woudl be a change of behvior for use to default to hw:mem_page_size=any but i think it would be the correct thign to do for operators in the long run.

i could bring this up with the core team again but in the past we decided to be conservitive and just warn peopel to alwasy set hw:mem_page_size if using numa affinity.

...
Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical when using hw:numa_nodes=1.

I did not hit an issue with 'hw:mem_page_size' not set, maybe I am missing some known test cases? It would be very helpful to have a test case where I could reproduce this issue with 'hw:numa_nodes=1' being set, but without 'hw:mem_page_size' being set.

How to ensure this one for existing vms already running with 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? you unfortuletly need to resize the instance. tehre are some image porpeties you can set on an instance via nova-manage but you cannot use nova-mange to update the enbedd flavor and set this.

so you need to define a new flavour and resize.

this is the main reason we have not changed the default as it may requrie you to move instnace around if there placement is now invalid now that per numa node memory allocatons are correctly being accounted for.

if it was simple to change the default without any enduser or operator impact we would.

...
On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
if you set hw:numa_nodes there are two things you should keep in mind

first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 then hw:mem_page_size shoudl also be defiend on the falvor.

if you dont set hw:mem_page_size then the vam will be pinned to a host numa node but the avaible memory on the host numa node will not be taken into account

only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper in the kernel.

i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any small will use your kernels default page size for guest memory, typically this is 4k pages large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) and any will use small pages but allow the guest to request hugepages via the hw_page_size image property.

hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small and having a seperate flavor for hugepages. its really up to you.

the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 disables memory oversubsctipion.

so you will not be able ot oversubscibe the memory on the host.

in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio if you are using numa affinity.

https://that.guru/blog/the-numa-scheduling-story-in-nova/ and https://that.guru/blog/cpu-resources-redux/

are also good to read

i do not think stephen has a dedicated block on the memory aspect but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting hw:numa_nodes=1 will casue.

if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or hw_mem_page_size set in the image then that vm is not configure properly.

On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote:

...
Another good resource =)

https://that.guru/blog/cpu-resources/

On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote:

...
I don't think so.

~~~ The most common case will be that the admin only sets hw:numa_nodes and then the flavor vCPUs and memory will be divided equally across the NUMA nodes. When a NUMA policy is in effect, it is mandatory for the instance's memory allocations to come from the NUMA nodes to which it is bound except where overriden by hw:numa_mem.NN. ~~~

Here are the implementation documents since Juno release:

https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem...

https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... ?

On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote:

...
Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as long as that flavor can fit into one numa node?

--

Alvaro Soto

*Note: My work hours may not be your work hours. Please do not feel the need to respond during a time that is not convenient for you.* ---------------------------------------------------------- Great people talk about ideas, ordinary people talk about things, small people talk... about other people.

Sean Mooney

11 May 11 May

2:22 a.m.

On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote:

...

Is it possible to update something in the Openstack database for the relevant VMs in order to do the same, and then hard reboot the VM so that the VM would have this attribute? not really adding the missing hw:mem_page_size requirement to the flavor chagnes the requirements for node placement and numa affinity so you really can only change this via resizing the vm to a new flavor

On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote:

...
So there's no default value assumed/set for hw:mem_page_size for each flavor?

correct this is a known edgecase in the currnt design hw:mem_page_size=any would be a resonable default but techinially if just set hw:numa_nodes=1 nova allow memory over subscription

in pratch if you try to do that you will almost always end up with vms being killed due to OOM events.

so from a api point of view it woudl be a change of behvior for use to default to hw:mem_page_size=any but i think it would be the correct thign to do for operators in the long run.

i could bring this up with the core team again but in the past we decided to be conservitive and just warn peopel to alwasy set hw:mem_page_size if using numa affinity.

...
Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical when using hw:numa_nodes=1.

I did not hit an issue with 'hw:mem_page_size' not set, maybe I am missing some known test cases? It would be very helpful to have a test case where I could reproduce this issue with 'hw:numa_nodes=1' being set, but without 'hw:mem_page_size' being set.

How to ensure this one for existing vms already running with 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? you unfortuletly need to resize the instance. tehre are some image porpeties you can set on an instance via nova-manage but you cannot use nova-mange to update the enbedd flavor and set this.

so you need to define a new flavour and resize.

this is the main reason we have not changed the default as it may requrie you to move instnace around if there placement is now invalid now that per numa node memory allocatons are correctly being accounted for.

if it was simple to change the default without any enduser or operator impact we would.

...
On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
if you set hw:numa_nodes there are two things you should keep in mind

first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 then hw:mem_page_size shoudl also be defiend on the falvor.

if you dont set hw:mem_page_size then the vam will be pinned to a host numa node but the avaible memory on the host numa node will not be taken into account

only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper in the kernel.

i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any small will use your kernels default page size for guest memory, typically this is 4k pages large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) and any will use small pages but allow the guest to request hugepages via the hw_page_size image property.

hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small and having a seperate flavor for hugepages. its really up to you.

the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 disables memory oversubsctipion.

so you will not be able ot oversubscibe the memory on the host.

in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio if you are using numa affinity.

https://that.guru/blog/the-numa-scheduling-story-in-nova/ and https://that.guru/blog/cpu-resources-redux/

are also good to read

i do not think stephen has a dedicated block on the memory aspect but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting hw:numa_nodes=1 will casue.

if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or hw_mem_page_size set in the image then that vm is not configure properly.

On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote:

...
Another good resource =)

https://that.guru/blog/cpu-resources/

On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote:

...
I don't think so.

~~~ The most common case will be that the admin only sets hw:numa_nodes and then the flavor vCPUs and memory will be divided equally across the NUMA nodes. When a NUMA policy is in effect, it is mandatory for the instance's memory allocations to come from the NUMA nodes to which it is bound except where overriden by hw:numa_mem.NN. ~~~

Here are the implementation documents since Juno release:

https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem...

https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... ?

On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote:

> Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > long as that flavor can fit into one numa node? > >

--

Alvaro Soto

*Note: My work hours may not be your work hours. Please do not feel the need to respond during a time that is not convenient for you.* ---------------------------------------------------------- Great people talk about ideas, ordinary people talk about things, small people talk... about other people.

hai wu

6:40 a.m.

Ok. Then I don't understand why 'hw:mem_page_size' is not made the default in case if hw:numa_node is set. There is a huge disadvantage if not having this one set (all existing VMs with hw:numa_node set will have to be taken down for resizing in order to get this one right). I could not find this point mentioned in any existing Openstack documentation: that we would have to set hw:mem_page_size explicitly if hw:numa_node is set. Also this slide at https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of indicates that hw:mem_page_size `Default to small pages`. Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com> wrote:

...

On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote:

...
Is it possible to update something in the Openstack database for the relevant VMs in order to do the same, and then hard reboot the VM so that the VM would have this attribute? not really adding the missing hw:mem_page_size requirement to the flavor chagnes the requirements for node placement and numa affinity so you really can only change this via resizing the vm to a new flavor

On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote:

...
So there's no default value assumed/set for hw:mem_page_size for each flavor?

correct this is a known edgecase in the currnt design hw:mem_page_size=any would be a resonable default but techinially if just set hw:numa_nodes=1 nova allow memory over subscription

in pratch if you try to do that you will almost always end up with vms being killed due to OOM events.

so from a api point of view it woudl be a change of behvior for use to default to hw:mem_page_size=any but i think it would be the correct thign to do for operators in the long run.

i could bring this up with the core team again but in the past we decided to be conservitive and just warn peopel to alwasy set hw:mem_page_size if using numa affinity.

...
Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical when using hw:numa_nodes=1.

I did not hit an issue with 'hw:mem_page_size' not set, maybe I am missing some known test cases? It would be very helpful to have a test case where I could reproduce this issue with 'hw:numa_nodes=1' being set, but without 'hw:mem_page_size' being set.

How to ensure this one for existing vms already running with 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? you unfortuletly need to resize the instance. tehre are some image porpeties you can set on an instance via nova-manage but you cannot use nova-mange to update the enbedd flavor and set this.

so you need to define a new flavour and resize.

this is the main reason we have not changed the default as it may requrie you to move instnace around if there placement is now invalid now that per numa node memory allocatons are correctly being accounted for.

if it was simple to change the default without any enduser or operator impact we would.

...
On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
if you set hw:numa_nodes there are two things you should keep in mind

first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 then hw:mem_page_size shoudl also be defiend on the falvor.

if you dont set hw:mem_page_size then the vam will be pinned to a host numa node but the avaible memory on the host numa node will not be taken into account

only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper in the kernel.

i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any small will use your kernels default page size for guest memory, typically this is 4k pages large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) and any will use small pages but allow the guest to request hugepages via the hw_page_size image property.

hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small and having a seperate flavor for hugepages. its really up to you.

the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 disables memory oversubsctipion.

so you will not be able ot oversubscibe the memory on the host.

in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio if you are using numa affinity.

https://that.guru/blog/the-numa-scheduling-story-in-nova/ and https://that.guru/blog/cpu-resources-redux/

are also good to read

i do not think stephen has a dedicated block on the memory aspect but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting hw:numa_nodes=1 will casue.

if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or hw_mem_page_size set in the image then that vm is not configure properly.

On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote:

...
Another good resource =)

https://that.guru/blog/cpu-resources/

On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote:

> I don't think so. > > ~~~ > The most common case will be that the admin only sets hw:numa_nodes and > then the flavor vCPUs and memory will be divided equally across the NUMA > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > memory allocations to come from the NUMA nodes to which it is bound except > where overriden by hw:numa_mem.NN. > ~~~ > > Here are the implementation documents since Juno release: > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > ? > > > On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote: > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > long as that flavor can fit into one numa node? > > > > > > -- > > Alvaro Soto > > *Note: My work hours may not be your work hours. Please do not feel the > need to respond during a time that is not convenient for you.* > ---------------------------------------------------------- > Great people talk about ideas, > ordinary people talk about things, > small people talk... about other people. >

Sean Mooney

12:40 p.m.

...

Ok. Then I don't understand why 'hw:mem_page_size' is not made the default in case if hw:numa_node is set. There is a huge disadvantage if not having this one set (all existing VMs with hw:numa_node set will have to be taken down for resizing in order to get this one right).

On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote: there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin.

...

I could not find this point mentioned in any existing Openstack documentation: that we would have to set hw:mem_page_size explicitly if hw:numa_node is set. Also this slide at https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of indicates that hw:mem_page_size `Default to small pages`.

it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

...

Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1?

yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

...

On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote:

...
Is it possible to update something in the Openstack database for the relevant VMs in order to do the same, and then hard reboot the VM so that the VM would have this attribute? not really adding the missing hw:mem_page_size requirement to the flavor chagnes the requirements for node placement and numa affinity so you really can only change this via resizing the vm to a new flavor

On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote:

...
So there's no default value assumed/set for hw:mem_page_size for each flavor?

correct this is a known edgecase in the currnt design hw:mem_page_size=any would be a resonable default but techinially if just set hw:numa_nodes=1 nova allow memory over subscription

in pratch if you try to do that you will almost always end up with vms being killed due to OOM events.

so from a api point of view it woudl be a change of behvior for use to default to hw:mem_page_size=any but i think it would be the correct thign to do for operators in the long run.

i could bring this up with the core team again but in the past we decided to be conservitive and just warn peopel to alwasy set hw:mem_page_size if using numa affinity.

...
Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical when using hw:numa_nodes=1.

I did not hit an issue with 'hw:mem_page_size' not set, maybe I am missing some known test cases? It would be very helpful to have a test case where I could reproduce this issue with 'hw:numa_nodes=1' being set, but without 'hw:mem_page_size' being set.

How to ensure this one for existing vms already running with 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? you unfortuletly need to resize the instance. tehre are some image porpeties you can set on an instance via nova-manage but you cannot use nova-mange to update the enbedd flavor and set this.

so you need to define a new flavour and resize.

this is the main reason we have not changed the default as it may requrie you to move instnace around if there placement is now invalid now that per numa node memory allocatons are correctly being accounted for.

if it was simple to change the default without any enduser or operator impact we would.

...
On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
if you set hw:numa_nodes there are two things you should keep in mind

first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 then hw:mem_page_size shoudl also be defiend on the falvor.

if you dont set hw:mem_page_size then the vam will be pinned to a host numa node but the avaible memory on the host numa node will not be taken into account

only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper in the kernel.

i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any small will use your kernels default page size for guest memory, typically this is 4k pages large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) and any will use small pages but allow the guest to request hugepages via the hw_page_size image property.

hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small and having a seperate flavor for hugepages. its really up to you.

the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 disables memory oversubsctipion.

so you will not be able ot oversubscibe the memory on the host.

in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio if you are using numa affinity.

https://that.guru/blog/the-numa-scheduling-story-in-nova/ and https://that.guru/blog/cpu-resources-redux/

are also good to read

i do not think stephen has a dedicated block on the memory aspect but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting hw:numa_nodes=1 will casue.

if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or hw_mem_page_size set in the image then that vm is not configure properly.

On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > Another good resource =) > > https://that.guru/blog/cpu-resources/ > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote: > > > I don't think so. > > > > ~~~ > > The most common case will be that the admin only sets hw:numa_nodes and > > then the flavor vCPUs and memory will be divided equally across the NUMA > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > memory allocations to come from the NUMA nodes to which it is bound except > > where overriden by hw:numa_mem.NN. > > ~~~ > > > > Here are the implementation documents since Juno release: > > > > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > > ? > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote: > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > long as that flavor can fit into one numa node? > > > > > > > > > > -- > > > > Alvaro Soto > > > > *Note: My work hours may not be your work hours. Please do not feel the > > need to respond during a time that is not convenient for you.* > > ---------------------------------------------------------- > > Great people talk about ideas, > > ordinary people talk about things, > > small people talk... about other people. > > > >

hai wu

15 May 15 May

11:03 a.m.

...

...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

About the above point, it seems even with the numa patch back ported and in place, the VM would be stuck in its existing numa node. Per my tests, after its live migration, the VM will end up on the other host's numa node #0, even if numa node#1 has more free memory. This is not the case for newly built VMs. Is this a design issue? On Thu, May 11, 2023 at 2:42 PM Sean Mooney <smooney@redhat.com> wrote:

...

...
Ok. Then I don't understand why 'hw:mem_page_size' is not made the default in case if hw:numa_node is set. There is a huge disadvantage if not having this one set (all existing VMs with hw:numa_node set will have to be taken down for resizing in order to get this one right).

On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote: there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin.

...
I could not find this point mentioned in any existing Openstack documentation: that we would have to set hw:mem_page_size explicitly if hw:numa_node is set. Also this slide at https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of indicates that hw:mem_page_size `Default to small pages`.

it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1?

yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

...
On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote:

...
Is it possible to update something in the Openstack database for the relevant VMs in order to do the same, and then hard reboot the VM so that the VM would have this attribute? not really adding the missing hw:mem_page_size requirement to the flavor chagnes the requirements for node placement and numa affinity so you really can only change this via resizing the vm to a new flavor

On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote:

...
So there's no default value assumed/set for hw:mem_page_size for each flavor?

correct this is a known edgecase in the currnt design hw:mem_page_size=any would be a resonable default but techinially if just set hw:numa_nodes=1 nova allow memory over subscription

in pratch if you try to do that you will almost always end up with vms being killed due to OOM events.

so from a api point of view it woudl be a change of behvior for use to default to hw:mem_page_size=any but i think it would be the correct thign to do for operators in the long run.

i could bring this up with the core team again but in the past we decided to be conservitive and just warn peopel to alwasy set hw:mem_page_size if using numa affinity.

...
Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical when using hw:numa_nodes=1.

I did not hit an issue with 'hw:mem_page_size' not set, maybe I am missing some known test cases? It would be very helpful to have a test case where I could reproduce this issue with 'hw:numa_nodes=1' being set, but without 'hw:mem_page_size' being set.

How to ensure this one for existing vms already running with 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? you unfortuletly need to resize the instance. tehre are some image porpeties you can set on an instance via nova-manage but you cannot use nova-mange to update the enbedd flavor and set this.

so you need to define a new flavour and resize.

this is the main reason we have not changed the default as it may requrie you to move instnace around if there placement is now invalid now that per numa node memory allocatons are correctly being accounted for.

if it was simple to change the default without any enduser or operator impact we would.

...
On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote: > > if you set hw:numa_nodes there are two things you should keep in mind > > first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 > then hw:mem_page_size shoudl also be defiend on the falvor. > > if you dont set hw:mem_page_size then the vam will be pinned to a host numa node > but the avaible memory on the host numa node will not be taken into account > > only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper > in the kernel. > > i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any > small will use your kernels default page size for guest memory, typically this is 4k pages > large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) > and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. > > hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small > and having a seperate flavor for hugepages. its really up to you. > > > the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 > disables memory oversubsctipion. > > so you will not be able ot oversubscibe the memory on the host. > > in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. > you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio > if you are using numa affinity. > > https://that.guru/blog/the-numa-scheduling-story-in-nova/ > and > https://that.guru/blog/cpu-resources-redux/ > > are also good to read > > i do not think stephen has a dedicated block on the memory aspect > but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting > hw:numa_nodes=1 will casue. > > if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or > hw_mem_page_size set in the image then that vm is not configure properly. > > On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > > Another good resource =) > > > > https://that.guru/blog/cpu-resources/ > > > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote: > > > > > I don't think so. > > > > > > ~~~ > > > The most common case will be that the admin only sets hw:numa_nodes and > > > then the flavor vCPUs and memory will be divided equally across the NUMA > > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > > memory allocations to come from the NUMA nodes to which it is bound except > > > where overriden by hw:numa_mem.NN. > > > ~~~ > > > > > > Here are the implementation documents since Juno release: > > > > > > > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > > > > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > > > ? > > > > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote: > > > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > > long as that flavor can fit into one numa node? > > > > > > > > > > > > > > -- > > > > > > Alvaro Soto > > > > > > *Note: My work hours may not be your work hours. Please do not feel the > > > need to respond during a time that is not convenient for you.* > > > ---------------------------------------------------------- > > > Great people talk about ideas, > > > ordinary people talk about things, > > > small people talk... about other people. > > > > > > > >

Sean Mooney

12:31 p.m.

...

...
...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

About the above point, it seems even with the numa patch back ported and in place, the VM would be stuck in its existing numa node. Per my tests, after its live migration, the VM will end up on the other host's numa node #0, even if numa node#1 has more free memory. This is not the case for newly built VMs.

Is this a design issue? if you are using a release that supprot numa live migration (train +) https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/num...

On Mon, 2023-05-15 at 13:03 -0500, hai wu wrote: then the numa affintiy is recalulated on live migration however numa node 0 is prefered. as of xena [compute]/packing_host_numa_cells_allocation_strategy has been added to contol how vms are balanced acros numa nodes in zed the default was changed form packing vms per host numa node to balancing vms between host numa nodes https://docs.openstack.org/releasenotes/nova/zed.html#relnotes-26-0-0-stable... even without the enhanchemt in xena and zed it was possible for the scheduler to select a numa node if you dont enable memory or cpu aware numa placment with hw:mem_page_size or hw:cpu_policy=dedicated then it will always select numa 0 if you do not request cpu pinnign or a specifc page size the sechudler cant properly select the host nuam node and will alwasy use numa node 0. That is one of the reason i said that if hw:numa_nodes is set then hw:mem_page_size shoudl be set. from a nova point of view using numa_nodes without mem_page_size is logically incorrect as you asked for a vm to be affinites to n host numa nodes but did not enable numa aware memory scheduling. we unfortnally cant provent this in the nova api without breaking upgrades for everyone who has made this mistake. we woudl need to force them to resize all affected instances which means guest downtime. the other issue si multiple numa nodes are supproted by Hyper-V but they do not supprot mem_page_size we have tried to document this in the past but never agreed on how becasuse it subtel and requries alot of context. the tl;dr is if the instace has a numa toplogy it should have mem_page_size set in the image or flavor but we never foudn a good place to capture that.

...

On Thu, May 11, 2023 at 2:42 PM Sean Mooney <smooney@redhat.com> wrote:

...
...
Ok. Then I don't understand why 'hw:mem_page_size' is not made the default in case if hw:numa_node is set. There is a huge disadvantage if not having this one set (all existing VMs with hw:numa_node set will have to be taken down for resizing in order to get this one right).

On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote: there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin.

...
I could not find this point mentioned in any existing Openstack documentation: that we would have to set hw:mem_page_size explicitly if hw:numa_node is set. Also this slide at https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of indicates that hw:mem_page_size `Default to small pages`.

it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1?

yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

...
On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote:

...
Is it possible to update something in the Openstack database for the relevant VMs in order to do the same, and then hard reboot the VM so that the VM would have this attribute? not really adding the missing hw:mem_page_size requirement to the flavor chagnes the requirements for node placement and numa affinity so you really can only change this via resizing the vm to a new flavor

On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote: > So there's no default value assumed/set for hw:mem_page_size for each > flavor? > correct this is a known edgecase in the currnt design hw:mem_page_size=any would be a resonable default but techinially if just set hw:numa_nodes=1 nova allow memory over subscription

in pratch if you try to do that you will almost always end up with vms being killed due to OOM events.

so from a api point of view it woudl be a change of behvior for use to default to hw:mem_page_size=any but i think it would be the correct thign to do for operators in the long run.

i could bring this up with the core team again but in the past we decided to be conservitive and just warn peopel to alwasy set hw:mem_page_size if using numa affinity.

> Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical > when using hw:numa_nodes=1. > > I did not hit an issue with 'hw:mem_page_size' not set, maybe I am > missing some known test cases? It would be very helpful to have a test > case where I could reproduce this issue with 'hw:numa_nodes=1' being > set, but without 'hw:mem_page_size' being set. > > How to ensure this one for existing vms already running with > 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? you unfortuletly need to resize the instance. tehre are some image porpeties you can set on an instance via nova-manage but you cannot use nova-mange to update the enbedd flavor and set this.

so you need to define a new flavour and resize.

this is the main reason we have not changed the default as it may requrie you to move instnace around if there placement is now invalid now that per numa node memory allocatons are correctly being accounted for.

if it was simple to change the default without any enduser or operator impact we would.

> > On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote: > > > > if you set hw:numa_nodes there are two things you should keep in mind > > > > first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 > > then hw:mem_page_size shoudl also be defiend on the falvor. > > > > if you dont set hw:mem_page_size then the vam will be pinned to a host numa node > > but the avaible memory on the host numa node will not be taken into account > > > > only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper > > in the kernel. > > > > i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any > > small will use your kernels default page size for guest memory, typically this is 4k pages > > large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) > > and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. > > > > hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small > > and having a seperate flavor for hugepages. its really up to you. > > > > > > the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 > > disables memory oversubsctipion. > > > > so you will not be able ot oversubscibe the memory on the host. > > > > in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. > > you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio > > if you are using numa affinity. > > > > https://that.guru/blog/the-numa-scheduling-story-in-nova/ > > and > > https://that.guru/blog/cpu-resources-redux/ > > > > are also good to read > > > > i do not think stephen has a dedicated block on the memory aspect > > but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting > > hw:numa_nodes=1 will casue. > > > > if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or > > hw_mem_page_size set in the image then that vm is not configure properly. > > > > On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > > > Another good resource =) > > > > > > https://that.guru/blog/cpu-resources/ > > > > > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote: > > > > > > > I don't think so. > > > > > > > > ~~~ > > > > The most common case will be that the admin only sets hw:numa_nodes and > > > > then the flavor vCPUs and memory will be divided equally across the NUMA > > > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > > > memory allocations to come from the NUMA nodes to which it is bound except > > > > where overriden by hw:numa_mem.NN. > > > > ~~~ > > > > > > > > Here are the implementation documents since Juno release: > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > > > > > > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > > > > ? > > > > > > > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote: > > > > > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > > > long as that flavor can fit into one numa node? > > > > > > > > > > > > > > > > > > -- > > > > > > > > Alvaro Soto > > > > > > > > *Note: My work hours may not be your work hours. Please do not feel the > > > > need to respond during a time that is not convenient for you.* > > > > ---------------------------------------------------------- > > > > Great people talk about ideas, > > > > ordinary people talk about things, > > > > small people talk... about other people. > > > > > > > > > > > > >

hai wu

12:46 p.m.

This patch was backported: https://review.opendev.org/c/openstack/nova/+/805649. Once this is in place, new VMs always get assigned correctly to the numa node with more free memory. But when existing VMs (created with vm flavor with hw:numa_node=1 set) already running on numa node #0 got live migrated, it would always be stuck on numa node #0 after live migration. So it seems we would also need to set hw:mem_page_size=small on the vm flavor, so that new VMs created from that flavor would be able to land on different numa node other than node#0 after its live migration? On Mon, May 15, 2023 at 2:33 PM Sean Mooney <smooney@redhat.com> wrote:

...

...
...
...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

About the above point, it seems even with the numa patch back ported and in place, the VM would be stuck in its existing numa node. Per my tests, after its live migration, the VM will end up on the other host's numa node #0, even if numa node#1 has more free memory. This is not the case for newly built VMs.

Is this a design issue? if you are using a release that supprot numa live migration (train +) https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/num...

On Mon, 2023-05-15 at 13:03 -0500, hai wu wrote: then the numa affintiy is recalulated on live migration however numa node 0 is prefered.

as of xena [compute]/packing_host_numa_cells_allocation_strategy has been added to contol how vms are balanced acros numa nodes in zed the default was changed form packing vms per host numa node to balancing vms between host numa nodes https://docs.openstack.org/releasenotes/nova/zed.html#relnotes-26-0-0-stable...

even without the enhanchemt in xena and zed it was possible for the scheduler to select a numa node

if you dont enable memory or cpu aware numa placment with hw:mem_page_size or hw:cpu_policy=dedicated then it will always select numa 0

if you do not request cpu pinnign or a specifc page size the sechudler cant properly select the host nuam node and will alwasy use numa node 0. That is one of the reason i said that if hw:numa_nodes is set then hw:mem_page_size shoudl be set.

from a nova point of view using numa_nodes without mem_page_size is logically incorrect as you asked for a vm to be affinites to n host numa nodes but did not enable numa aware memory scheduling.

we unfortnally cant provent this in the nova api without breaking upgrades for everyone who has made this mistake. we woudl need to force them to resize all affected instances which means guest downtime. the other issue si multiple numa nodes are supproted by Hyper-V but they do not supprot mem_page_size

we have tried to document this in the past but never agreed on how becasuse it subtel and requries alot of context. the tl;dr is if the instace has a numa toplogy it should have mem_page_size set in the image or flavor but we never foudn a good place to capture that.

...
On Thu, May 11, 2023 at 2:42 PM Sean Mooney <smooney@redhat.com> wrote:

...
...
Ok. Then I don't understand why 'hw:mem_page_size' is not made the default in case if hw:numa_node is set. There is a huge disadvantage if not having this one set (all existing VMs with hw:numa_node set will have to be taken down for resizing in order to get this one right).

On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote: there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin.

...
I could not find this point mentioned in any existing Openstack documentation: that we would have to set hw:mem_page_size explicitly if hw:numa_node is set. Also this slide at https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of indicates that hw:mem_page_size `Default to small pages`.

it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1?

yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

...
On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote:

...
Is it possible to update something in the Openstack database for the relevant VMs in order to do the same, and then hard reboot the VM so that the VM would have this attribute? not really adding the missing hw:mem_page_size requirement to the flavor chagnes the requirements for node placement and numa affinity so you really can only change this via resizing the vm to a new flavor

On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote: > > On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote: > > So there's no default value assumed/set for hw:mem_page_size for each > > flavor? > > > correct this is a known edgecase in the currnt design > hw:mem_page_size=any would be a resonable default but > techinially if just set hw:numa_nodes=1 nova allow memory over subscription > > in pratch if you try to do that you will almost always end up with vms > being killed due to OOM events. > > so from a api point of view it woudl be a change of behvior for use to default > to hw:mem_page_size=any but i think it would be the correct thign to do for operators > in the long run. > > i could bring this up with the core team again but in the past we > decided to be conservitive and just warn peopel to alwasy set > hw:mem_page_size if using numa affinity. > > > Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical > > when using hw:numa_nodes=1. > > > > I did not hit an issue with 'hw:mem_page_size' not set, maybe I am > > missing some known test cases? It would be very helpful to have a test > > case where I could reproduce this issue with 'hw:numa_nodes=1' being > > set, but without 'hw:mem_page_size' being set. > > > > How to ensure this one for existing vms already running with > > 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? > you unfortuletly need to resize the instance. > tehre are some image porpeties you can set on an instance via nova-manage > but you cannot use nova-mange to update the enbedd flavor and set this. > > so you need to define a new flavour and resize. > > this is the main reason we have not changed the default as it may requrie you to > move instnace around if there placement is now invalid now that per numa node memory > allocatons are correctly being accounted for. > > if it was simple to change the default without any enduser or operator impact we would. > > > > > > > On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote: > > > > > > if you set hw:numa_nodes there are two things you should keep in mind > > > > > > first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 > > > then hw:mem_page_size shoudl also be defiend on the falvor. > > > > > > if you dont set hw:mem_page_size then the vam will be pinned to a host numa node > > > but the avaible memory on the host numa node will not be taken into account > > > > > > only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper > > > in the kernel. > > > > > > i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any > > > small will use your kernels default page size for guest memory, typically this is 4k pages > > > large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) > > > and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. > > > > > > hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small > > > and having a seperate flavor for hugepages. its really up to you. > > > > > > > > > the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 > > > disables memory oversubsctipion. > > > > > > so you will not be able ot oversubscibe the memory on the host. > > > > > > in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. > > > you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio > > > if you are using numa affinity. > > > > > > https://that.guru/blog/the-numa-scheduling-story-in-nova/ > > > and > > > https://that.guru/blog/cpu-resources-redux/ > > > > > > are also good to read > > > > > > i do not think stephen has a dedicated block on the memory aspect > > > but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting > > > hw:numa_nodes=1 will casue. > > > > > > if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or > > > hw_mem_page_size set in the image then that vm is not configure properly. > > > > > > On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > > > > Another good resource =) > > > > > > > > https://that.guru/blog/cpu-resources/ > > > > > > > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote: > > > > > > > > > I don't think so. > > > > > > > > > > ~~~ > > > > > The most common case will be that the admin only sets hw:numa_nodes and > > > > > then the flavor vCPUs and memory will be divided equally across the NUMA > > > > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > > > > memory allocations to come from the NUMA nodes to which it is bound except > > > > > where overriden by hw:numa_mem.NN. > > > > > ~~~ > > > > > > > > > > Here are the implementation documents since Juno release: > > > > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > > > > > > > > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > > > > > ? > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote: > > > > > > > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > > > > long as that flavor can fit into one numa node? > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Alvaro Soto > > > > > > > > > > *Note: My work hours may not be your work hours. Please do not feel the > > > > > need to respond during a time that is not convenient for you.* > > > > > ---------------------------------------------------------- > > > > > Great people talk about ideas, > > > > > ordinary people talk about things, > > > > > small people talk... about other people. > > > > > > > > > > > > > > > > > > >

Maksim Malchuk

1:07 p.m.

There is 6 month this backport without review, since Sean Mooney gives +2. The next rebase needed to solve merge conflict have cleaned +2 from review. On Mon, May 15, 2023 at 10:53 PM hai wu <haiwu.us@gmail.com> wrote:

...

This patch was backported: https://review.opendev.org/c/openstack/nova/+/805649. Once this is in place, new VMs always get assigned correctly to the numa node with more free memory. But when existing VMs (created with vm flavor with hw:numa_node=1 set) already running on numa node #0 got live migrated, it would always be stuck on numa node #0 after live migration.

So it seems we would also need to set hw:mem_page_size=small on the vm flavor, so that new VMs created from that flavor would be able to land on different numa node other than node#0 after its live migration?

On Mon, May 15, 2023 at 2:33 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Mon, 2023-05-15 at 13:03 -0500, hai wu wrote:

...
...
...
Another question: Let's say a VM runs on one host's numa node #0.

...
...
...
...
we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to

If the second.

...
...
About the above point, it seems even with the numa patch back ported and in place, the VM would be stuck in its existing numa node. Per my tests, after its live migration, the VM will end up on the other host's numa node #0, even if numa node#1 has more free memory. This is not the case for newly built VMs.

Is this a design issue?

if you are using a release that supprot numa live migration (train +)

https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/num...

...
then the numa affintiy is recalulated on live migration however numa node 0 is prefered.

as of xena [compute]/packing_host_numa_cells_allocation_strategy has been added to contol how vms are balanced acros numa nodes in zed the default was changed form packing vms per host numa node to balancing vms between host numa nodes

https://docs.openstack.org/releasenotes/nova/zed.html#relnotes-26-0-0-stable...

...
even without the enhanchemt in xena and zed it was possible for the

scheduler to select a numa node

...
if you dont enable memory or cpu aware numa placment with hw:mem_page_size or hw:cpu_policy=dedicated then it will always select

numa 0

...
if you do not request cpu pinnign or a specifc page size the sechudler

...
and will alwasy use numa node 0. That is one of the reason i said that if hw:numa_nodes is set then hw:mem_page_size shoudl be set.

from a nova point of view using numa_nodes without mem_page_size is logically incorrect as you asked for a vm to be affinites to n host numa nodes but did not enable numa aware memory scheduling.

we unfortnally cant provent this in the nova api without breaking upgrades for everyone who has made this mistake. we woudl need to force them to resize all affected instances which means guest downtime. the other issue si multiple numa nodes are supproted by Hyper-V but they do not supprot mem_page_size

we have tried to document this in the past but never agreed on how becasuse it subtel and requries alot of context. the tl;dr is if the instace has a numa toplogy it should have mem_page_size set in the image or flavor but we never foudn a good place to capture that.

...
On Thu, May 11, 2023 at 2:42 PM Sean Mooney <smooney@redhat.com>

wrote:

...
...
On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote:

...
Ok. Then I don't understand why 'hw:mem_page_size' is not made the default in case if hw:numa_node is set. There is a huge

disadvantage

...
...
if not having this one set (all existing VMs with hw:numa_node set will have to be taken down for resizing in order to get this one right). there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin.

I could not find this point mentioned in any existing Openstack documentation: that we would have to set hw:mem_page_size explicitly if hw:numa_node is set. Also this slide at https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of indicates that hw:mem_page_size `Default to small pages`. it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

...
Another question: Let's say a VM runs on one host's numa node #0.

If

...
we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to

...
...
...
...
On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com>

wrote:

...
...
On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote: > Is it possible to update something in the Openstack database

for the

...
> relevant VMs in order to do the same, and then hard reboot the VM so > that the VM would have this attribute? not really adding the missing hw:mem_page_size requirement to

...
...
...
...
...
requirements for node placement and numa affinity so you really can only change this via resizing the vm to a new flavor > > On Wed, May 10, 2023 at 2:47 PM Sean Mooney < smooney@redhat.com> wrote: > > > > On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote: > > > So there's no default value assumed/set for hw:mem_page_size for each > > > flavor? > > > > > correct this is a known edgecase in the currnt design > > hw:mem_page_size=any would be a resonable default but > > techinially if just set hw:numa_nodes=1 nova allow memory over subscription > > > > in pratch if you try to do that you will almost always end up with vms > > being killed due to OOM events. > > > > so from a api point of view it woudl be a change of behvior for use to default > > to hw:mem_page_size=any but i think it would be the correct

...
...
...
...
...
> > in the long run. > > > > i could bring this up with the core team again but in the

...
...
...
...
...
> > decided to be conservitive and just warn peopel to alwasy set > > hw:mem_page_size if using numa affinity. > > > > > Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical > > > when using hw:numa_nodes=1. > > > > > > I did not hit an issue with 'hw:mem_page_size' not set, maybe I am > > > missing some known test cases? It would be very helpful to have a test > > > case where I could reproduce this issue with 'hw:numa_nodes=1' being > > > set, but without 'hw:mem_page_size' being set. > > > > > > How to ensure this one for existing vms already running with > > > 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? > > you unfortuletly need to resize the instance. > > tehre are some image porpeties you can set on an instance via nova-manage > > but you cannot use nova-mange to update the enbedd flavor and set this. > > > > so you need to define a new flavour and resize. > > > > this is the main reason we have not changed the default as it may requrie you to > > move instnace around if there placement is now invalid now

...
...
...
...
...
> > allocatons are correctly being accounted for. > > > > if it was simple to change the default without any enduser or operator impact we would. > > > > > > > > > > > > On Wed, May 10, 2023 at 1:47 PM Sean Mooney < smooney@redhat.com> wrote: > > > > > > > > if you set hw:numa_nodes there are two things you should keep in mind > > > > > > > > first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 > > > > then hw:mem_page_size shoudl also be defiend on the falvor. > > > > > > > > if you dont set hw:mem_page_size then the vam will be

...
...
...
...
...
> > > > but the avaible memory on the host numa node will not be taken into account > > > > > > > > only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper > > > > in the kernel. > > > > > > > > i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any > > > > small will use your kernels default page size for guest memory, typically this is 4k pages > > > > large will use any pages size other then the smallest

...
...
...
...
...
> > > > and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. > > > > > > > > hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small > > > > and having a seperate flavor for hugepages. its really up to you. > > > > > > > > > > > > the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 > > > > disables memory oversubsctipion. > > > > > > > > so you will not be able ot oversubscibe the memory on

cant properly select the host nuam node the second. the flavor chagnes the thign to do for operators past we that per numa node memory pinned to a host numa node that is avaiable (i.e. this will use hugepages) the host.

...
...
...
...
...
> > > > > > > > in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. > > > > you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio > > > > if you are using numa affinity. > > > > > > > > https://that.guru/blog/the-numa-scheduling-story-in-nova/ > > > > and > > > > https://that.guru/blog/cpu-resources-redux/ > > > > > > > > are also good to read > > > > > > > > i do not think stephen has a dedicated block on the memory aspect > > > > but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting > > > > hw:numa_nodes=1 will casue. > > > > > > > > if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or > > > > hw_mem_page_size set in the image then that vm is not configure properly. > > > > > > > > On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > > > > > Another good resource =) > > > > > > > > > > https://that.guru/blog/cpu-resources/ > > > > > > > > > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto < alsotoes@gmail.com> wrote: > > > > > > > > > > > I don't think so. > > > > > > > > > > > > ~~~ > > > > > > The most common case will be that the admin only sets hw:numa_nodes and > > > > > > then the flavor vCPUs and memory will be divided equally across the NUMA > > > > > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > > > > > memory allocations to come from the NUMA nodes to which it is bound except > > > > > > where overriden by hw:numa_mem.NN. > > > > > > ~~~ > > > > > > > > > > > > Here are the implementation documents since Juno release: > > > > > > > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > > > > > > ? > > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu < haiwu.us@gmail.com> wrote: > > > > > > > > > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > > > > > long as that flavor can fit into one numa node? > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Alvaro Soto > > > > > > > > > > > > *Note: My work hours may not be your work hours. Please do not feel the > > > > > > need to respond during a time that is not convenient for you.* > > > > > >

...
...
...
...
...
> > > > > > Great people talk about ideas, > > > > > > ordinary people talk about things, > > > > > > small people talk... about other people. > > > > > > > > > > > > > > > > > > > > > > > > > >

-- Regards, Maksim Malchuk

Sean Mooney

1:25 p.m.

On Mon, 2023-05-15 at 23:07 +0300, Maksim Malchuk wrote:

...

There is 6 month this backport without review, since Sean Mooney gives +2. The next rebase needed to solve merge conflict have cleaned +2 from review.

yes it was blocked on a question regarding does this confirm to stable backprot policy. we do not backport features and while this was considreed a bugfix on master it was also acknoladged that it is also a little featureish we discussed this last weak in the nova team meeting and agree it could proceed. but as i noted in my last reply this will have no effect if you jsut have hw:numa_nodes=1 without hw:cpu_policy=dedicated or hw:mem_page_size with enableing cpu pinnign or explcit page size we do not track per numa node cpu or memory usage in the host numa toplogy object for a given compute node. as such without any usage informational there is noting to we the numa nodes with. so packing_host_numa_cells_allocation_strategy=false will not make vms that request a numa toplogy without numa resouce be balanced between the numa nodes. you still need to resize the instance to a flavor that actully properly request memory or cpu pinging.

...

On Mon, May 15, 2023 at 10:53 PM hai wu <haiwu.us@gmail.com> wrote:

...
This patch was backported: https://review.opendev.org/c/openstack/nova/+/805649. Once this is in place, new VMs always get assigned correctly to the numa node with more free memory. But when existing VMs (created with vm flavor with hw:numa_node=1 set) already running on numa node #0 got live migrated, it would always be stuck on numa node #0 after live migration.

So it seems we would also need to set hw:mem_page_size=small on the vm flavor, so that new VMs created from that flavor would be able to land on different numa node other than node#0 after its live migration?

On Mon, May 15, 2023 at 2:33 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Mon, 2023-05-15 at 13:03 -0500, hai wu wrote:

...
...
...
Another question: Let's say a VM runs on one host's numa node #0.

...
...
...
...
we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to

If the second.

...
...
About the above point, it seems even with the numa patch back ported and in place, the VM would be stuck in its existing numa node. Per my tests, after its live migration, the VM will end up on the other host's numa node #0, even if numa node#1 has more free memory. This is not the case for newly built VMs.

Is this a design issue?

if you are using a release that supprot numa live migration (train +)

https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/num...

...
then the numa affintiy is recalulated on live migration however numa node 0 is prefered.

as of xena [compute]/packing_host_numa_cells_allocation_strategy has been added to contol how vms are balanced acros numa nodes in zed the default was changed form packing vms per host numa node to balancing vms between host numa nodes

https://docs.openstack.org/releasenotes/nova/zed.html#relnotes-26-0-0-stable...

...
even without the enhanchemt in xena and zed it was possible for the

scheduler to select a numa node

...
if you dont enable memory or cpu aware numa placment with hw:mem_page_size or hw:cpu_policy=dedicated then it will always select

numa 0

...
if you do not request cpu pinnign or a specifc page size the sechudler

...
and will alwasy use numa node 0. That is one of the reason i said that if hw:numa_nodes is set then hw:mem_page_size shoudl be set.

from a nova point of view using numa_nodes without mem_page_size is logically incorrect as you asked for a vm to be affinites to n host numa nodes but did not enable numa aware memory scheduling.

we unfortnally cant provent this in the nova api without breaking upgrades for everyone who has made this mistake. we woudl need to force them to resize all affected instances which means guest downtime. the other issue si multiple numa nodes are supproted by Hyper-V but they do not supprot mem_page_size

we have tried to document this in the past but never agreed on how becasuse it subtel and requries alot of context. the tl;dr is if the instace has a numa toplogy it should have mem_page_size set in the image or flavor but we never foudn a good place to capture that.

...
On Thu, May 11, 2023 at 2:42 PM Sean Mooney <smooney@redhat.com>

wrote:

...
...
On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote:

...
Ok. Then I don't understand why 'hw:mem_page_size' is not made the default in case if hw:numa_node is set. There is a huge

disadvantage

...
...
if not having this one set (all existing VMs with hw:numa_node set will have to be taken down for resizing in order to get this one right). there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin.

I could not find this point mentioned in any existing Openstack documentation: that we would have to set hw:mem_page_size explicitly if hw:numa_node is set. Also this slide at https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of indicates that hw:mem_page_size `Default to small pages`. it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

...
Another question: Let's say a VM runs on one host's numa node #0.

If

...
we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to

...
...
...
...
On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com>

wrote:

...
> > On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote: > > Is it possible to update something in the Openstack database for the > > relevant VMs in order to do the same, and then hard reboot the VM so > > that the VM would have this attribute? > not really adding the missing hw:mem_page_size requirement to

...
...
...
...
> requirements for node placement and numa affinity > so you really can only change this via resizing the vm to a new flavor > > > > On Wed, May 10, 2023 at 2:47 PM Sean Mooney < smooney@redhat.com> wrote: > > > > > > On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote: > > > > So there's no default value assumed/set for hw:mem_page_size for each > > > > flavor? > > > > > > > correct this is a known edgecase in the currnt design > > > hw:mem_page_size=any would be a resonable default but > > > techinially if just set hw:numa_nodes=1 nova allow memory over subscription > > > > > > in pratch if you try to do that you will almost always end up with vms > > > being killed due to OOM events. > > > > > > so from a api point of view it woudl be a change of behvior for use to default > > > to hw:mem_page_size=any but i think it would be the correct

...
...
...
...
> > > in the long run. > > > > > > i could bring this up with the core team again but in the

...
...
...
...
> > > decided to be conservitive and just warn peopel to alwasy set > > > hw:mem_page_size if using numa affinity. > > > > > > > Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical > > > > when using hw:numa_nodes=1. > > > > > > > > I did not hit an issue with 'hw:mem_page_size' not set, maybe I am > > > > missing some known test cases? It would be very helpful to have a test > > > > case where I could reproduce this issue with 'hw:numa_nodes=1' being > > > > set, but without 'hw:mem_page_size' being set. > > > > > > > > How to ensure this one for existing vms already running with > > > > 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? > > > you unfortuletly need to resize the instance. > > > tehre are some image porpeties you can set on an instance via nova-manage > > > but you cannot use nova-mange to update the enbedd flavor and set this. > > > > > > so you need to define a new flavour and resize. > > > > > > this is the main reason we have not changed the default as it may requrie you to > > > move instnace around if there placement is now invalid now

...
...
...
...
> > > allocatons are correctly being accounted for. > > > > > > if it was simple to change the default without any enduser or operator impact we would. > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 1:47 PM Sean Mooney < smooney@redhat.com> wrote: > > > > > > > > > > if you set hw:numa_nodes there are two things you should keep in mind > > > > > > > > > > first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 > > > > > then hw:mem_page_size shoudl also be defiend on the falvor. > > > > > > > > > > if you dont set hw:mem_page_size then the vam will be

...
...
...
...
> > > > > but the avaible memory on the host numa node will not be taken into account > > > > > > > > > > only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper > > > > > in the kernel. > > > > > > > > > > i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any > > > > > small will use your kernels default page size for guest memory, typically this is 4k pages > > > > > large will use any pages size other then the smallest

...
...
...
...
> > > > > and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. > > > > > > > > > > hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small > > > > > and having a seperate flavor for hugepages. its really up to you. > > > > > > > > > > > > > > > the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 > > > > > disables memory oversubsctipion. > > > > > > > > > > so you will not be able ot oversubscibe the memory on

cant properly select the host nuam node the second. the flavor chagnes the thign to do for operators past we that per numa node memory pinned to a host numa node that is avaiable (i.e. this will use hugepages) the host.

...
...
...
...
> > > > > > > > > > in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. > > > > > you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio > > > > > if you are using numa affinity. > > > > > > > > > > https://that.guru/blog/the-numa-scheduling-story-in-nova/ > > > > > and > > > > > https://that.guru/blog/cpu-resources-redux/ > > > > > > > > > > are also good to read > > > > > > > > > > i do not think stephen has a dedicated block on the memory aspect > > > > > but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting > > > > > hw:numa_nodes=1 will casue. > > > > > > > > > > if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or > > > > > hw_mem_page_size set in the image then that vm is not configure properly. > > > > > > > > > > On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > > > > > > Another good resource =) > > > > > > > > > > > > https://that.guru/blog/cpu-resources/ > > > > > > > > > > > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto < alsotoes@gmail.com> wrote: > > > > > > > > > > > > > I don't think so. > > > > > > > > > > > > > > ~~~ > > > > > > > The most common case will be that the admin only sets hw:numa_nodes and > > > > > > > then the flavor vCPUs and memory will be divided equally across the NUMA > > > > > > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > > > > > > memory allocations to come from the NUMA nodes to which it is bound except > > > > > > > where overriden by hw:numa_mem.NN. > > > > > > > ~~~ > > > > > > > > > > > > > > Here are the implementation documents since Juno release: > > > > > > > > > > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > > > > > > > ? > > > > > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu < haiwu.us@gmail.com> wrote: > > > > > > > > > > > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > > > > > > long as that flavor can fit into one numa node? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Alvaro Soto > > > > > > > > > > > > > > *Note: My work hours may not be your work hours. Please do not feel the > > > > > > > need to respond during a time that is not convenient for you.* > > > > > > >

...
...
...
...
> > > > > > > Great people talk about ideas, > > > > > > > ordinary people talk about things, > > > > > > > small people talk... about other people. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

Maksim Malchuk

1:32 p.m.

Good news, we waiting for it in Xena almost an year. On Mon, May 15, 2023 at 11:27 PM Sean Mooney <smooney@redhat.com> wrote:

...

On Mon, 2023-05-15 at 23:07 +0300, Maksim Malchuk wrote:

...
There is 6 month this backport without review, since Sean Mooney gives +2. The next rebase needed to solve merge conflict have cleaned +2 from review.

yes it was blocked on a question regarding does this confirm to stable backprot policy.

we do not backport features and while this was considreed a bugfix on master it was also acknoladged that it is also a little featureish

we discussed this last weak in the nova team meeting and agree it could proceed.

but as i noted in my last reply this will have no effect if you jsut have hw:numa_nodes=1

without hw:cpu_policy=dedicated or hw:mem_page_size

with enableing cpu pinnign or explcit page size we do not track per numa node cpu or memory usage in the host numa toplogy object for a given compute node. as such without any usage informational there is noting to we the numa nodes with.

so packing_host_numa_cells_allocation_strategy=false will not make vms that request a numa toplogy without numa resouce be balanced between the numa nodes.

you still need to resize the instance to a flavor that actully properly request memory or cpu pinging.

...
On Mon, May 15, 2023 at 10:53 PM hai wu <haiwu.us@gmail.com> wrote:

...
This patch was backported: https://review.opendev.org/c/openstack/nova/+/805649. Once this is in place, new VMs always get assigned correctly to the numa node with more free memory. But when existing VMs (created with vm flavor with hw:numa_node=1 set) already running on numa node #0 got live migrated, it would always be stuck on numa node #0 after live migration.

So it seems we would also need to set hw:mem_page_size=small on the vm flavor, so that new VMs created from that flavor would be able to land on different numa node other than node#0 after its live migration?

On Mon, May 15, 2023 at 2:33 PM Sean Mooney <smooney@redhat.com>

...
...
...
On Mon, 2023-05-15 at 13:03 -0500, hai wu wrote:

...
...
> Another question: Let's say a VM runs on one host's numa node

...
...
...
> we live-migrate this VM to another host, and that host's numa node #1 > has more free memory, is it possible for this VM to land on the other > host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to

#0. If the second.

...
...
About the above point, it seems even with the numa patch back

...
...
...
...
and in place, the VM would be stuck in its existing numa node. Per my tests, after its live migration, the VM will end up on the other host's numa node #0, even if numa node#1 has more free memory. This is not the case for newly built VMs.

Is this a design issue? if you are using a release that supprot numa live migration (train +)

https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/num...

...
...
then the numa affintiy is recalulated on live migration however numa node 0 is prefered.

as of xena [compute]/packing_host_numa_cells_allocation_strategy has been added to contol how vms are balanced acros numa nodes in zed the default was changed form packing vms per host numa node to balancing vms between host numa nodes

https://docs.openstack.org/releasenotes/nova/zed.html#relnotes-26-0-0-stable...

...
...
even without the enhanchemt in xena and zed it was possible for the

scheduler to select a numa node

...
if you dont enable memory or cpu aware numa placment with hw:mem_page_size or hw:cpu_policy=dedicated then it will always

select numa 0

...
if you do not request cpu pinnign or a specifc page size the

sechudler cant properly select the host nuam node

...
and will alwasy use numa node 0. That is one of the reason i said

...
...
if hw:numa_nodes is set then hw:mem_page_size shoudl be set.

...
from a nova point of view using numa_nodes without mem_page_size is

logically incorrect as you asked for

...
a vm to be affinites to n host numa nodes but did not enable numa aware memory scheduling.

we unfortnally cant provent this in the nova api without breaking upgrades for everyone who has made this mistake. we woudl need to force them to resize all affected instances which means guest downtime. the other issue si multiple numa nodes are supproted by Hyper-V but

...
...
do not supprot mem_page_size

...
we have tried to document this in the past but never agreed on how

becasuse it subtel and requries alot of context.

...
the tl;dr is if the instace has a numa toplogy it should have mem_page_size set in the image or flavor but we never foudn a good place to capture that.

...
On Thu, May 11, 2023 at 2:42 PM Sean Mooney <smooney@redhat.com>

wrote:

...
...
On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote: > Ok. Then I don't understand why 'hw:mem_page_size' is not made

...
...
...
...
...
> default in case if hw:numa_node is set. There is a huge disadvantage > if not having this one set (all existing VMs with hw:numa_node set > will have to be taken down for resizing in order to get this one > right). there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin. > > I could not find this point mentioned in any existing Openstack > documentation: that we would have to set hw:mem_page_size explicitly > if hw:numa_node is set. Also this slide at > https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of > indicates that hw:mem_page_size `Default to small pages`. it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

> > Another question: Let's say a VM runs on one host's numa node #0. If > we live-migrate this VM to another host, and that host's numa node #1 > has more free memory, is it possible for this VM to land on the other > host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second. > > On Thu, May 11, 2023 at 4:25 AM Sean Mooney < smooney@redhat.com> wrote: > > > > On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote: > > > Is it possible to update something in the Openstack database for the > > > relevant VMs in order to do the same, and then hard reboot

wrote: ported that they the the

...
...
...
...
...
> > > that the VM would have this attribute? > > not really adding the missing hw:mem_page_size requirement to

...
...
...
> > requirements for node placement and numa affinity > > so you really can only change this via resizing the vm to a new flavor > > > > > > On Wed, May 10, 2023 at 2:47 PM Sean Mooney < smooney@redhat.com> wrote: > > > > > > > > On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote: > > > > > So there's no default value assumed/set for hw:mem_page_size for each > > > > > flavor? > > > > > > > > > correct this is a known edgecase in the currnt design > > > > hw:mem_page_size=any would be a resonable default but > > > > techinially if just set hw:numa_nodes=1 nova allow memory over subscription > > > > > > > > in pratch if you try to do that you will almost always end up with vms > > > > being killed due to OOM events. > > > > > > > > so from a api point of view it woudl be a change of behvior for use to default > > > > to hw:mem_page_size=any but i think it would be the correct

...
...
...
> > > > in the long run. > > > > > > > > i could bring this up with the core team again but in the

...
...
...
> > > > decided to be conservitive and just warn peopel to alwasy set > > > > hw:mem_page_size if using numa affinity. > > > > > > > > > Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical > > > > > when using hw:numa_nodes=1. > > > > > > > > > > I did not hit an issue with 'hw:mem_page_size' not set, maybe I am > > > > > missing some known test cases? It would be very helpful to have a test > > > > > case where I could reproduce this issue with 'hw:numa_nodes=1' being > > > > > set, but without 'hw:mem_page_size' being set. > > > > > > > > > > How to ensure this one for existing vms already running with > > > > > 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? > > > > you unfortuletly need to resize the instance. > > > > tehre are some image porpeties you can set on an instance via nova-manage > > > > but you cannot use nova-mange to update the enbedd flavor and set this. > > > > > > > > so you need to define a new flavour and resize. > > > > > > > > this is the main reason we have not changed the default as it may requrie you to > > > > move instnace around if there placement is now invalid now

...
...
...
> > > > allocatons are correctly being accounted for. > > > > > > > > if it was simple to change the default without any enduser or operator impact we would. > > > > > > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 1:47 PM Sean Mooney < smooney@redhat.com> wrote: > > > > > > > > > > > > if you set hw:numa_nodes there are two things you should keep in mind > > > > > > > > > > > > first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 > > > > > > then hw:mem_page_size shoudl also be defiend on the falvor. > > > > > > > > > > > > if you dont set hw:mem_page_size then the vam will be

...
...
...
> > > > > > but the avaible memory on the host numa node will not be taken into account > > > > > > > > > > > > only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper > > > > > > in the kernel. > > > > > > > > > > > > i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any > > > > > > small will use your kernels default page size for guest memory, typically this is 4k pages > > > > > > large will use any pages size other then the smallest

...
...
...
> > > > > > and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. > > > > > > > > > > > > hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small > > > > > > and having a seperate flavor for hugepages. its really up to you. > > > > > > > > > > > > > > > > > > the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 > > > > > > disables memory oversubsctipion. > > > > > > > > > > > > so you will not be able ot oversubscibe the memory on

VM so the flavor chagnes the thign to do for operators past we that per numa node memory pinned to a host numa node that is avaiable (i.e. this will use hugepages) the host.

...
...
...
> > > > > > > > > > > > in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. > > > > > > you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio > > > > > > if you are using numa affinity. > > > > > > > > > > > > https://that.guru/blog/the-numa-scheduling-story-in-nova/ > > > > > > and > > > > > > https://that.guru/blog/cpu-resources-redux/ > > > > > > > > > > > > are also good to read > > > > > > > > > > > > i do not think stephen has a dedicated block on the memory aspect > > > > > > but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting > > > > > > hw:numa_nodes=1 will casue. > > > > > > > > > > > > if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or > > > > > > hw_mem_page_size set in the image then that vm is not configure properly. > > > > > > > > > > > > On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > > > > > > > Another good resource =) > > > > > > > > > > > > > > https://that.guru/blog/cpu-resources/ > > > > > > > > > > > > > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto < alsotoes@gmail.com> wrote: > > > > > > > > > > > > > > > I don't think so. > > > > > > > > > > > > > > > > ~~~ > > > > > > > > The most common case will be that the admin only sets hw:numa_nodes and > > > > > > > > then the flavor vCPUs and memory will be divided equally across the NUMA > > > > > > > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > > > > > > > memory allocations to come from the NUMA nodes to which it is bound except > > > > > > > > where overriden by hw:numa_mem.NN. > > > > > > > > ~~~ > > > > > > > > > > > > > > > > Here are the implementation documents since Juno release: > > > > > > > > > > > > > > > > > > > > > > > >

https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem...

...
...
...
...
> > > > > > > > > > > > > > > >

https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a...

...
...
...
...
> > > > > > > > ? > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu < haiwu.us@gmail.com> wrote: > > > > > > > > > > > > > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > > > > > > > long as that flavor can fit into one numa node? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Alvaro Soto > > > > > > > > > > > > > > > > *Note: My work hours may not be your work hours. Please do not feel the > > > > > > > > need to respond during a time that is not convenient for you.* > > > > > > > >

...
...
...
> > > > > > > > Great people talk about ideas, > > > > > > > > ordinary people talk about things, > > > > > > > > small people talk... about other people. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

-- Regards, Maksim Malchuk

Sean Mooney

1:18 p.m.

On Mon, 2023-05-15 at 14:46 -0500, hai wu wrote:

...

This patch was backported: https://review.opendev.org/c/openstack/nova/+/805649. Once this is in place, new VMs always get assigned correctly to the numa node with more free memory. But when existing VMs (created with vm flavor with hw:numa_node=1 set) already running on numa node #0 got live migrated, it would always be stuck on numa node #0 after live migration. if the vm only has hw:numa_node=1 https://review.opendev.org/c/openstack/nova/+/805649 wont help

because we never claim any mempages or cpus in the host numa toplogy blob as such the sorting based on usage to balance the nodes wont work since there is never any usage recored for vms with just hw:numa_node=1 and nothign else set.

...

So it seems we would also need to set hw:mem_page_size=small on the vm flavor, so that new VMs created from that flavor would be able to land on different numa node other than node#0 after its live migration?

yes again becasue mem_page_size there is no usage in the host numa toplogy blob so as far as the schduler/resouces tracker is concerned all numa nodes are equally used. so it will always select nuam 0 by default since the scheduling algortim is deterministic.

...

On Mon, May 15, 2023 at 2:33 PM Sean Mooney <smooney@redhat.com> wrote:

...
...
...
...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

About the above point, it seems even with the numa patch back ported and in place, the VM would be stuck in its existing numa node. Per my tests, after its live migration, the VM will end up on the other host's numa node #0, even if numa node#1 has more free memory. This is not the case for newly built VMs.

Is this a design issue? if you are using a release that supprot numa live migration (train +) https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/num...

On Mon, 2023-05-15 at 13:03 -0500, hai wu wrote: then the numa affintiy is recalulated on live migration however numa node 0 is prefered.

as of xena [compute]/packing_host_numa_cells_allocation_strategy has been added to contol how vms are balanced acros numa nodes in zed the default was changed form packing vms per host numa node to balancing vms between host numa nodes https://docs.openstack.org/releasenotes/nova/zed.html#relnotes-26-0-0-stable...

even without the enhanchemt in xena and zed it was possible for the scheduler to select a numa node

if you dont enable memory or cpu aware numa placment with hw:mem_page_size or hw:cpu_policy=dedicated then it will always select numa 0

if you do not request cpu pinnign or a specifc page size the sechudler cant properly select the host nuam node and will alwasy use numa node 0. That is one of the reason i said that if hw:numa_nodes is set then hw:mem_page_size shoudl be set.

from a nova point of view using numa_nodes without mem_page_size is logically incorrect as you asked for a vm to be affinites to n host numa nodes but did not enable numa aware memory scheduling.

we unfortnally cant provent this in the nova api without breaking upgrades for everyone who has made this mistake. we woudl need to force them to resize all affected instances which means guest downtime. the other issue si multiple numa nodes are supproted by Hyper-V but they do not supprot mem_page_size

we have tried to document this in the past but never agreed on how becasuse it subtel and requries alot of context. the tl;dr is if the instace has a numa toplogy it should have mem_page_size set in the image or flavor but we never foudn a good place to capture that.

...
On Thu, May 11, 2023 at 2:42 PM Sean Mooney <smooney@redhat.com> wrote:

...
...
Ok. Then I don't understand why 'hw:mem_page_size' is not made the default in case if hw:numa_node is set. There is a huge disadvantage if not having this one set (all existing VMs with hw:numa_node set will have to be taken down for resizing in order to get this one right).

On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote: there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin.

...
I could not find this point mentioned in any existing Openstack documentation: that we would have to set hw:mem_page_size explicitly if hw:numa_node is set. Also this slide at https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of indicates that hw:mem_page_size `Default to small pages`.

it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1?

yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

...
On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com> wrote:

...
On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote: > Is it possible to update something in the Openstack database for the > relevant VMs in order to do the same, and then hard reboot the VM so > that the VM would have this attribute? not really adding the missing hw:mem_page_size requirement to the flavor chagnes the requirements for node placement and numa affinity so you really can only change this via resizing the vm to a new flavor > > On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote: > > > > On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote: > > > So there's no default value assumed/set for hw:mem_page_size for each > > > flavor? > > > > > correct this is a known edgecase in the currnt design > > hw:mem_page_size=any would be a resonable default but > > techinially if just set hw:numa_nodes=1 nova allow memory over subscription > > > > in pratch if you try to do that you will almost always end up with vms > > being killed due to OOM events. > > > > so from a api point of view it woudl be a change of behvior for use to default > > to hw:mem_page_size=any but i think it would be the correct thign to do for operators > > in the long run. > > > > i could bring this up with the core team again but in the past we > > decided to be conservitive and just warn peopel to alwasy set > > hw:mem_page_size if using numa affinity. > > > > > Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical > > > when using hw:numa_nodes=1. > > > > > > I did not hit an issue with 'hw:mem_page_size' not set, maybe I am > > > missing some known test cases? It would be very helpful to have a test > > > case where I could reproduce this issue with 'hw:numa_nodes=1' being > > > set, but without 'hw:mem_page_size' being set. > > > > > > How to ensure this one for existing vms already running with > > > 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? > > you unfortuletly need to resize the instance. > > tehre are some image porpeties you can set on an instance via nova-manage > > but you cannot use nova-mange to update the enbedd flavor and set this. > > > > so you need to define a new flavour and resize. > > > > this is the main reason we have not changed the default as it may requrie you to > > move instnace around if there placement is now invalid now that per numa node memory > > allocatons are correctly being accounted for. > > > > if it was simple to change the default without any enduser or operator impact we would. > > > > > > > > > > > > On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote: > > > > > > > > if you set hw:numa_nodes there are two things you should keep in mind > > > > > > > > first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 > > > > then hw:mem_page_size shoudl also be defiend on the falvor. > > > > > > > > if you dont set hw:mem_page_size then the vam will be pinned to a host numa node > > > > but the avaible memory on the host numa node will not be taken into account > > > > > > > > only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper > > > > in the kernel. > > > > > > > > i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any > > > > small will use your kernels default page size for guest memory, typically this is 4k pages > > > > large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) > > > > and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. > > > > > > > > hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small > > > > and having a seperate flavor for hugepages. its really up to you. > > > > > > > > > > > > the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 > > > > disables memory oversubsctipion. > > > > > > > > so you will not be able ot oversubscibe the memory on the host. > > > > > > > > in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. > > > > you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio > > > > if you are using numa affinity. > > > > > > > > https://that.guru/blog/the-numa-scheduling-story-in-nova/ > > > > and > > > > https://that.guru/blog/cpu-resources-redux/ > > > > > > > > are also good to read > > > > > > > > i do not think stephen has a dedicated block on the memory aspect > > > > but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting > > > > hw:numa_nodes=1 will casue. > > > > > > > > if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or > > > > hw_mem_page_size set in the image then that vm is not configure properly. > > > > > > > > On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > > > > > Another good resource =) > > > > > > > > > > https://that.guru/blog/cpu-resources/ > > > > > > > > > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote: > > > > > > > > > > > I don't think so. > > > > > > > > > > > > ~~~ > > > > > > The most common case will be that the admin only sets hw:numa_nodes and > > > > > > then the flavor vCPUs and memory will be divided equally across the NUMA > > > > > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > > > > > memory allocations to come from the NUMA nodes to which it is bound except > > > > > > where overriden by hw:numa_mem.NN. > > > > > > ~~~ > > > > > > > > > > > > Here are the implementation documents since Juno release: > > > > > > > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > > > > > > ? > > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote: > > > > > > > > > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > > > > > long as that flavor can fit into one numa node? > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Alvaro Soto > > > > > > > > > > > > *Note: My work hours may not be your work hours. Please do not feel the > > > > > > need to respond during a time that is not convenient for you.* > > > > > > ---------------------------------------------------------- > > > > > > Great people talk about ideas, > > > > > > ordinary people talk about things, > > > > > > small people talk... about other people. > > > > > > > > > > > > > > > > > > > > > > > > > >

hai wu

1:46 p.m.

Hmm, regarding this: `if the vm only has hw:numa_node=1 https://review.opendev.org/c/openstack/nova/+/805649 wont help`. Per my recent numerous tests, if the vm only has hw:numa_node=1 https://review.opendev.org/c/openstack/nova/+/805649 will actually help, but only for newly built VMs, it works pretty well only for newly built VMs. On Mon, May 15, 2023 at 3:21 PM Sean Mooney <smooney@redhat.com> wrote:

...

On Mon, 2023-05-15 at 14:46 -0500, hai wu wrote:

...
This patch was backported: https://review.opendev.org/c/openstack/nova/+/805649. Once this is in place, new VMs always get assigned correctly to the numa node with more free memory. But when existing VMs (created with vm flavor with hw:numa_node=1 set) already running on numa node #0 got live migrated, it would always be stuck on numa node #0 after live migration. if the vm only has hw:numa_node=1 https://review.opendev.org/c/openstack/nova/+/805649 wont help

because we never claim any mempages or cpus in the host numa toplogy blob as such the sorting based on usage to balance the nodes wont work since there is never any usage recored for vms with just hw:numa_node=1 and nothign else set.

...
So it seems we would also need to set hw:mem_page_size=small on the vm flavor, so that new VMs created from that flavor would be able to land on different numa node other than node#0 after its live migration?

yes again becasue mem_page_size there is no usage in the host numa toplogy blob so as far as the schduler/resouces tracker is concerned all numa nodes are equally used.

so it will always select nuam 0 by default since the scheduling algortim is deterministic.

...
On Mon, May 15, 2023 at 2:33 PM Sean Mooney <smooney@redhat.com> wrote:

...
...
...
...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

About the above point, it seems even with the numa patch back ported and in place, the VM would be stuck in its existing numa node. Per my tests, after its live migration, the VM will end up on the other host's numa node #0, even if numa node#1 has more free memory. This is not the case for newly built VMs.

Is this a design issue? if you are using a release that supprot numa live migration (train +) https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/num...

On Mon, 2023-05-15 at 13:03 -0500, hai wu wrote: then the numa affintiy is recalulated on live migration however numa node 0 is prefered.

as of xena [compute]/packing_host_numa_cells_allocation_strategy has been added to contol how vms are balanced acros numa nodes in zed the default was changed form packing vms per host numa node to balancing vms between host numa nodes https://docs.openstack.org/releasenotes/nova/zed.html#relnotes-26-0-0-stable...

even without the enhanchemt in xena and zed it was possible for the scheduler to select a numa node

if you dont enable memory or cpu aware numa placment with hw:mem_page_size or hw:cpu_policy=dedicated then it will always select numa 0

if you do not request cpu pinnign or a specifc page size the sechudler cant properly select the host nuam node and will alwasy use numa node 0. That is one of the reason i said that if hw:numa_nodes is set then hw:mem_page_size shoudl be set.

from a nova point of view using numa_nodes without mem_page_size is logically incorrect as you asked for a vm to be affinites to n host numa nodes but did not enable numa aware memory scheduling.

we unfortnally cant provent this in the nova api without breaking upgrades for everyone who has made this mistake. we woudl need to force them to resize all affected instances which means guest downtime. the other issue si multiple numa nodes are supproted by Hyper-V but they do not supprot mem_page_size

we have tried to document this in the past but never agreed on how becasuse it subtel and requries alot of context. the tl;dr is if the instace has a numa toplogy it should have mem_page_size set in the image or flavor but we never foudn a good place to capture that.

...
On Thu, May 11, 2023 at 2:42 PM Sean Mooney <smooney@redhat.com> wrote:

...
...
Ok. Then I don't understand why 'hw:mem_page_size' is not made the default in case if hw:numa_node is set. There is a huge disadvantage if not having this one set (all existing VMs with hw:numa_node set will have to be taken down for resizing in order to get this one right).

On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote: there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin.

...
I could not find this point mentioned in any existing Openstack documentation: that we would have to set hw:mem_page_size explicitly if hw:numa_node is set. Also this slide at https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of indicates that hw:mem_page_size `Default to small pages`.

it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

...
Another question: Let's say a VM runs on one host's numa node #0. If we live-migrate this VM to another host, and that host's numa node #1 has more free memory, is it possible for this VM to land on the other host's numa node #1?

yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

...
On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com> wrote: > > On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote: > > Is it possible to update something in the Openstack database for the > > relevant VMs in order to do the same, and then hard reboot the VM so > > that the VM would have this attribute? > not really adding the missing hw:mem_page_size requirement to the flavor chagnes the > requirements for node placement and numa affinity > so you really can only change this via resizing the vm to a new flavor > > > > On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote: > > > > > > On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote: > > > > So there's no default value assumed/set for hw:mem_page_size for each > > > > flavor? > > > > > > > correct this is a known edgecase in the currnt design > > > hw:mem_page_size=any would be a resonable default but > > > techinially if just set hw:numa_nodes=1 nova allow memory over subscription > > > > > > in pratch if you try to do that you will almost always end up with vms > > > being killed due to OOM events. > > > > > > so from a api point of view it woudl be a change of behvior for use to default > > > to hw:mem_page_size=any but i think it would be the correct thign to do for operators > > > in the long run. > > > > > > i could bring this up with the core team again but in the past we > > > decided to be conservitive and just warn peopel to alwasy set > > > hw:mem_page_size if using numa affinity. > > > > > > > Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical > > > > when using hw:numa_nodes=1. > > > > > > > > I did not hit an issue with 'hw:mem_page_size' not set, maybe I am > > > > missing some known test cases? It would be very helpful to have a test > > > > case where I could reproduce this issue with 'hw:numa_nodes=1' being > > > > set, but without 'hw:mem_page_size' being set. > > > > > > > > How to ensure this one for existing vms already running with > > > > 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? > > > you unfortuletly need to resize the instance. > > > tehre are some image porpeties you can set on an instance via nova-manage > > > but you cannot use nova-mange to update the enbedd flavor and set this. > > > > > > so you need to define a new flavour and resize. > > > > > > this is the main reason we have not changed the default as it may requrie you to > > > move instnace around if there placement is now invalid now that per numa node memory > > > allocatons are correctly being accounted for. > > > > > > if it was simple to change the default without any enduser or operator impact we would. > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote: > > > > > > > > > > if you set hw:numa_nodes there are two things you should keep in mind > > > > > > > > > > first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 > > > > > then hw:mem_page_size shoudl also be defiend on the falvor. > > > > > > > > > > if you dont set hw:mem_page_size then the vam will be pinned to a host numa node > > > > > but the avaible memory on the host numa node will not be taken into account > > > > > > > > > > only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper > > > > > in the kernel. > > > > > > > > > > i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any > > > > > small will use your kernels default page size for guest memory, typically this is 4k pages > > > > > large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) > > > > > and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. > > > > > > > > > > hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small > > > > > and having a seperate flavor for hugepages. its really up to you. > > > > > > > > > > > > > > > the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 > > > > > disables memory oversubsctipion. > > > > > > > > > > so you will not be able ot oversubscibe the memory on the host. > > > > > > > > > > in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. > > > > > you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio > > > > > if you are using numa affinity. > > > > > > > > > > https://that.guru/blog/the-numa-scheduling-story-in-nova/ > > > > > and > > > > > https://that.guru/blog/cpu-resources-redux/ > > > > > > > > > > are also good to read > > > > > > > > > > i do not think stephen has a dedicated block on the memory aspect > > > > > but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting > > > > > hw:numa_nodes=1 will casue. > > > > > > > > > > if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or > > > > > hw_mem_page_size set in the image then that vm is not configure properly. > > > > > > > > > > On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > > > > > > Another good resource =) > > > > > > > > > > > > https://that.guru/blog/cpu-resources/ > > > > > > > > > > > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote: > > > > > > > > > > > > > I don't think so. > > > > > > > > > > > > > > ~~~ > > > > > > > The most common case will be that the admin only sets hw:numa_nodes and > > > > > > > then the flavor vCPUs and memory will be divided equally across the NUMA > > > > > > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > > > > > > memory allocations to come from the NUMA nodes to which it is bound except > > > > > > > where overriden by hw:numa_mem.NN. > > > > > > > ~~~ > > > > > > > > > > > > > > Here are the implementation documents since Juno release: > > > > > > > > > > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > > > > > > > ? > > > > > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote: > > > > > > > > > > > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > > > > > > long as that flavor can fit into one numa node? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Alvaro Soto > > > > > > > > > > > > > > *Note: My work hours may not be your work hours. Please do not feel the > > > > > > > need to respond during a time that is not convenient for you.* > > > > > > > ---------------------------------------------------------- > > > > > > > Great people talk about ideas, > > > > > > > ordinary people talk about things, > > > > > > > small people talk... about other people. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

hai wu

2 Jun 2 Jun

10:11 a.m.

Sean, I just tried and set one existing flavor (already with "hw:numa_nodes='1'" set as its property) with 'openstack flavor set --property hw:mem_page_size=small', and created a new VM from this flavor. Then I tried to live migrate this VM from its source hypervisor (which has many numa nodes, and this VM running on numa node 5) to another hypervisor (which has only 2 numa nodes), and the live migration failed. Log messages are complaining that there's no numa node 5 found on the target hypervisor host. What else is needed in order for this numa live migration to work, so that this VM could end up in a different numa node on the target host? Is there any patch that's needed to be backported in order for this to work? On Mon, May 15, 2023 at 3:46 PM hai wu <haiwu.us@gmail.com> wrote:

...

Hmm, regarding this: `if the vm only has hw:numa_node=1 https://review.opendev.org/c/openstack/nova/+/805649 wont help`.

Per my recent numerous tests, if the vm only has hw:numa_node=1 https://review.opendev.org/c/openstack/nova/+/805649 will actually help, but only for newly built VMs, it works pretty well only for newly built VMs.

On Mon, May 15, 2023 at 3:21 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Mon, 2023-05-15 at 14:46 -0500, hai wu wrote:

...
This patch was backported: https://review.opendev.org/c/openstack/nova/+/805649. Once this is in place, new VMs always get assigned correctly to the numa node with more free memory. But when existing VMs (created with vm flavor with hw:numa_node=1 set) already running on numa node #0 got live migrated, it would always be stuck on numa node #0 after live migration. if the vm only has hw:numa_node=1 https://review.opendev.org/c/openstack/nova/+/805649 wont help

because we never claim any mempages or cpus in the host numa toplogy blob as such the sorting based on usage to balance the nodes wont work since there is never any usage recored for vms with just hw:numa_node=1 and nothign else set.

...
So it seems we would also need to set hw:mem_page_size=small on the vm flavor, so that new VMs created from that flavor would be able to land on different numa node other than node#0 after its live migration?

yes again becasue mem_page_size there is no usage in the host numa toplogy blob so as far as the schduler/resouces tracker is concerned all numa nodes are equally used.

so it will always select nuam 0 by default since the scheduling algortim is deterministic.

...
On Mon, May 15, 2023 at 2:33 PM Sean Mooney <smooney@redhat.com> wrote:

...
...
...
> Another question: Let's say a VM runs on one host's numa node #0. If > we live-migrate this VM to another host, and that host's numa node #1 > has more free memory, is it possible for this VM to land on the other > host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second.

About the above point, it seems even with the numa patch back ported and in place, the VM would be stuck in its existing numa node. Per my tests, after its live migration, the VM will end up on the other host's numa node #0, even if numa node#1 has more free memory. This is not the case for newly built VMs.

Is this a design issue? if you are using a release that supprot numa live migration (train +) https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/num...

On Mon, 2023-05-15 at 13:03 -0500, hai wu wrote: then the numa affintiy is recalulated on live migration however numa node 0 is prefered.

as of xena [compute]/packing_host_numa_cells_allocation_strategy has been added to contol how vms are balanced acros numa nodes in zed the default was changed form packing vms per host numa node to balancing vms between host numa nodes https://docs.openstack.org/releasenotes/nova/zed.html#relnotes-26-0-0-stable...

even without the enhanchemt in xena and zed it was possible for the scheduler to select a numa node

if you dont enable memory or cpu aware numa placment with hw:mem_page_size or hw:cpu_policy=dedicated then it will always select numa 0

if you do not request cpu pinnign or a specifc page size the sechudler cant properly select the host nuam node and will alwasy use numa node 0. That is one of the reason i said that if hw:numa_nodes is set then hw:mem_page_size shoudl be set.

from a nova point of view using numa_nodes without mem_page_size is logically incorrect as you asked for a vm to be affinites to n host numa nodes but did not enable numa aware memory scheduling.

we unfortnally cant provent this in the nova api without breaking upgrades for everyone who has made this mistake. we woudl need to force them to resize all affected instances which means guest downtime. the other issue si multiple numa nodes are supproted by Hyper-V but they do not supprot mem_page_size

we have tried to document this in the past but never agreed on how becasuse it subtel and requries alot of context. the tl;dr is if the instace has a numa toplogy it should have mem_page_size set in the image or flavor but we never foudn a good place to capture that.

...
On Thu, May 11, 2023 at 2:42 PM Sean Mooney <smooney@redhat.com> wrote:

...
On Thu, 2023-05-11 at 08:40 -0500, hai wu wrote: > Ok. Then I don't understand why 'hw:mem_page_size' is not made the > default in case if hw:numa_node is set. There is a huge disadvantage > if not having this one set (all existing VMs with hw:numa_node set > will have to be taken down for resizing in order to get this one > right). there is an upgrade impact to changign the default. its not impossibel to do but its complicated if we dont want to break exisitng deployments we woudl need to recored a value for eveny current instance that was spawned before this default was changed that had hw:numa_node without hw:mem_page_size so they kept the old behavior and make sure that is cleared when the vm is next moved so it can have the new default after a live migratoin. > > I could not find this point mentioned in any existing Openstack > documentation: that we would have to set hw:mem_page_size explicitly > if hw:numa_node is set. Also this slide at > https://www.linux-kvm.org/images/0/0b/03x03-Openstackpdf.pdf kind of > indicates that hw:mem_page_size `Default to small pages`. it defaults to unset. that results in small pages by default but its not the same as hw:mem_page_size=small or hw:mem_page_size=any.

> > Another question: Let's say a VM runs on one host's numa node #0. If > we live-migrate this VM to another host, and that host's numa node #1 > has more free memory, is it possible for this VM to land on the other > host's numa node #1? yes it is on newer relsese we will prefer to balance the load across numa nodes on older release nova woudl fill the first numa node then move to the second. > > On Thu, May 11, 2023 at 4:25 AM Sean Mooney <smooney@redhat.com> wrote: > > > > On Wed, 2023-05-10 at 15:06 -0500, hai wu wrote: > > > Is it possible to update something in the Openstack database for the > > > relevant VMs in order to do the same, and then hard reboot the VM so > > > that the VM would have this attribute? > > not really adding the missing hw:mem_page_size requirement to the flavor chagnes the > > requirements for node placement and numa affinity > > so you really can only change this via resizing the vm to a new flavor > > > > > > On Wed, May 10, 2023 at 2:47 PM Sean Mooney <smooney@redhat.com> wrote: > > > > > > > > On Wed, 2023-05-10 at 14:22 -0500, hai wu wrote: > > > > > So there's no default value assumed/set for hw:mem_page_size for each > > > > > flavor? > > > > > > > > > correct this is a known edgecase in the currnt design > > > > hw:mem_page_size=any would be a resonable default but > > > > techinially if just set hw:numa_nodes=1 nova allow memory over subscription > > > > > > > > in pratch if you try to do that you will almost always end up with vms > > > > being killed due to OOM events. > > > > > > > > so from a api point of view it woudl be a change of behvior for use to default > > > > to hw:mem_page_size=any but i think it would be the correct thign to do for operators > > > > in the long run. > > > > > > > > i could bring this up with the core team again but in the past we > > > > decided to be conservitive and just warn peopel to alwasy set > > > > hw:mem_page_size if using numa affinity. > > > > > > > > > Yes https://bugs.launchpad.net/nova/+bug/1893121 is critical > > > > > when using hw:numa_nodes=1. > > > > > > > > > > I did not hit an issue with 'hw:mem_page_size' not set, maybe I am > > > > > missing some known test cases? It would be very helpful to have a test > > > > > case where I could reproduce this issue with 'hw:numa_nodes=1' being > > > > > set, but without 'hw:mem_page_size' being set. > > > > > > > > > > How to ensure this one for existing vms already running with > > > > > 'hw:numa_nodes=1', but without 'hw:mem_page_size' being set? > > > > you unfortuletly need to resize the instance. > > > > tehre are some image porpeties you can set on an instance via nova-manage > > > > but you cannot use nova-mange to update the enbedd flavor and set this. > > > > > > > > so you need to define a new flavour and resize. > > > > > > > > this is the main reason we have not changed the default as it may requrie you to > > > > move instnace around if there placement is now invalid now that per numa node memory > > > > allocatons are correctly being accounted for. > > > > > > > > if it was simple to change the default without any enduser or operator impact we would. > > > > > > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 1:47 PM Sean Mooney <smooney@redhat.com> wrote: > > > > > > > > > > > > if you set hw:numa_nodes there are two things you should keep in mind > > > > > > > > > > > > first if hw:numa_nodes si set to any value incluing hw:numa_nodes=1 > > > > > > then hw:mem_page_size shoudl also be defiend on the falvor. > > > > > > > > > > > > if you dont set hw:mem_page_size then the vam will be pinned to a host numa node > > > > > > but the avaible memory on the host numa node will not be taken into account > > > > > > > > > > > > only the total free memory on the host so this almost always results in VMs being killed by the OOM reaper > > > > > > in the kernel. > > > > > > > > > > > > i recomend setting hw:mem_page_size=small hw:mem_page_size=large or hw:mem_page_size=any > > > > > > small will use your kernels default page size for guest memory, typically this is 4k pages > > > > > > large will use any pages size other then the smallest that is avaiable (i.e. this will use hugepages) > > > > > > and any will use small pages but allow the guest to request hugepages via the hw_page_size image property. > > > > > > > > > > > > hw:mem_page_size=any is the most flexable as a result but generally i recommend using hw:mem_page_size=small > > > > > > and having a seperate flavor for hugepages. its really up to you. > > > > > > > > > > > > > > > > > > the second thing to keep in mind is using expict numa toplolig8ies including hw:numa_nodes=1 > > > > > > disables memory oversubsctipion. > > > > > > > > > > > > so you will not be able ot oversubscibe the memory on the host. > > > > > > > > > > > > in general its better to avoid memory oversubscribtion anyway but jsut keep that in mind. > > > > > > you cant jsut allocate a buch of swap space and run vms at a 2:1 or higher memory over subscription ratio > > > > > > if you are using numa affinity. > > > > > > > > > > > > https://that.guru/blog/the-numa-scheduling-story-in-nova/ > > > > > > and > > > > > > https://that.guru/blog/cpu-resources-redux/ > > > > > > > > > > > > are also good to read > > > > > > > > > > > > i do not think stephen has a dedicated block on the memory aspect > > > > > > but https://bugs.launchpad.net/nova/+bug/1893121 covers some of the probelem that only setting > > > > > > hw:numa_nodes=1 will casue. > > > > > > > > > > > > if you have vms with hw:numa_nodes=1 set and you do not have hw:mem_page_size set in the falvor or > > > > > > hw_mem_page_size set in the image then that vm is not configure properly. > > > > > > > > > > > > On Wed, 2023-05-10 at 11:52 -0600, Alvaro Soto wrote: > > > > > > > Another good resource =) > > > > > > > > > > > > > > https://that.guru/blog/cpu-resources/ > > > > > > > > > > > > > > On Wed, May 10, 2023 at 11:50 AM Alvaro Soto <alsotoes@gmail.com> wrote: > > > > > > > > > > > > > > > I don't think so. > > > > > > > > > > > > > > > > ~~~ > > > > > > > > The most common case will be that the admin only sets hw:numa_nodes and > > > > > > > > then the flavor vCPUs and memory will be divided equally across the NUMA > > > > > > > > nodes. When a NUMA policy is in effect, it is mandatory for the instance's > > > > > > > > memory allocations to come from the NUMA nodes to which it is bound except > > > > > > > > where overriden by hw:numa_mem.NN. > > > > > > > > ~~~ > > > > > > > > > > > > > > > > Here are the implementation documents since Juno release: > > > > > > > > > > > > > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/src/branch/master/specs/juno/implem... > > > > > > > > > > > > > > > > https://opendev.org/openstack/nova-specs/commit/45252df4c54674d2ac71cd88154a... > > > > > > > > ? > > > > > > > > > > > > > > > > > > > > > > > > On Wed, May 10, 2023 at 11:31 AM hai wu <haiwu.us@gmail.com> wrote: > > > > > > > > > > > > > > > > > Is there any concern to enable 'hw:numa_nodes=1' on all flavors, as > > > > > > > > > long as that flavor can fit into one numa node? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Alvaro Soto > > > > > > > > > > > > > > > > *Note: My work hours may not be your work hours. Please do not feel the > > > > > > > > need to respond during a time that is not convenient for you.* > > > > > > > > ---------------------------------------------------------- > > > > > > > > Great people talk about ideas, > > > > > > > > ordinary people talk about things, > > > > > > > > small people talk... about other people. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >

955

Age (days ago)

978

Last active (days ago)

List overview

Download

18 comments

4 participants

participants (4)

Alvaro Soto
hai wu
Maksim Malchuk
Sean Mooney

[nova] hw:numa_nodes question

tags

participants (4)