[nova] Live migration of memory-intensive workload in error
Hello stackers, I'm reaching out to the community to understand how you manage migrations of memory-intensive instances. We are running RHOSP 17.1 (based on Wallaby) and have faced several issues when live-migrating instances running on compute nodes to do maintenance work. For several memory-intensive instances, the process just never completes properly and ends in error after hours waiting. During the migration, we can see in nova-computes.log of destination compute node that the migration never truly migrates the memory, whenever the reported percentage of memory remaining is close to 0%, it ends up to go back to high percentage again. By looking at the migration UUID, we saw that the memory processed bytes is way higher than the memory total bytes [1]. We played with several `live_migration` options in nova.conf but to no end. Has anyone faced the same issue ? And most importantly, how do you handle the migration of your memory-intensive workload ? Regards, Timothé [1]. $ openstack server migration show 328af6d9-9c9d-4671-a8f9-2c0d3df32b93 e2cda681-e25c-4aca-813c-3641bc6164c9 +------------------------+------------------------------------------------------------------+ | Field | Value | +------------------------+------------------------------------------------------------------+ | ID | 13231 | | Server UUID | 328af6d9-9c9d-4671-a8f9-2c0d3df32b93 | | Status | running | | Source Compute | compute02 | | Source Node | compute02 | | Dest Compute | compute01 | | Dest Host | None | | Dest Node | compute01 | | Memory Total Bytes | 137448202240 | | Memory Processed Bytes | 5300502117730 | | Memory Remaining Bytes | 52182556672 | | Disk Total Bytes | 0 | | Disk Processed Bytes | 0 | | Disk Remaining Bytes | 0 | | Created At | 2026-03-24T10:49:15.000000 | | Updated At | 2026-03-24T13:14:32.000000 | | UUID | e2cda681-e25c-4aca-813c-3641bc6164c9 | | User ID | cc4367e52cce828fa3e378f29ed6df553c2dd99e9a4b33f1835fee719d592c91 | | Project ID | 0382d25c311149fabd7bea0d6fa3ac37 | +------------------------+------------------------------------------------------------------+
Hello stackers,
I'm reaching out to the community to understand how you manage migrations of memory-intensive instances. We are running RHOSP 17.1 (based on Wallaby) and have faced several issues when live-migrating instances running on compute nodes to do maintenance work. for memory intesive workloads (with or without hugepages) we recommend enabling post-copy live migration https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
On 24/03/2026 13:28, Timothé Baugé wrote: this is effectively mandatory if a guest is using hugepages but it highly recommend in general. i thought this was the default in rhosp 17.1 but perhaps it only became the default in 18. can you confirm you have not override the default and that you are using post-copy? https://github.com/openstack-archive/tripleo-heat-templates/blob/stable/wall... in 16.2 this was not enable by default and some rhosp user never updated there config for the new defaults.
For several memory-intensive instances, the process just never completes properly and ends in error after hours waiting. During the migration, we can see in nova-computes.log of destination compute node that the migration never truly migrates the memory, whenever the reported percentage of memory remaining is close to 0%, it ends up to go back to high percentage again.
By looking at the migration UUID, we saw that the memory processed bytes is way higher than the memory total bytes [1]. ya that is what happens if the guest is writign to the memroy and
We played with several `live_migration` options in nova.conf but to no end.
setting ``` [libvirt] live_migration_timeout_action=force_complete ``` can help with this but it has the sideeffect that when the timeout expires it can result in perceptible downtime in the guest because it will force pause the guess for the memory copy to happen. https://github.com/openstack-k8s-operators/nova-operator/blob/main/templates... is our new defualts downstream ``` live_migration_permit_post_copy=true live_migration_permit_auto_converge=true live_migration_timeout_action=force_complete ``` i think shoudl actully be the defaults upstream and we shoudl eveutlly consider removign the option to disabel those but we are even more conservitige in our upstream default then our already conservitive downstream defautls. dirtying pages at a higher rate then your network bandwith will allow if a guest dirties 1 byte in a memory page the entire page needs to be copied again. that tollerable for 4k pages without post_copy but its not for 1G hugepages and often is not feasible for 2mb hugespages either. does the vm have hugepages enabled? there are some more advance tunabels like live_migration_downtime live_migration_downtime_steps live_migration_downtime_delay that might be of use but i generally do not recommend changing those unless you ahve already enabeld post_copy and the force_complete timeout action
Has anyone faced the same issue ? And most importantly, how do you handle the migration of your memory-intensive workload ?
Regards, Timothé
[1]. $ openstack server migration show 328af6d9-9c9d-4671-a8f9-2c0d3df32b93 e2cda681-e25c-4aca-813c-3641bc6164c9 +------------------------+------------------------------------------------------------------+ | Field | Value | +------------------------+------------------------------------------------------------------+ | ID | 13231 | | Server UUID | 328af6d9-9c9d-4671-a8f9-2c0d3df32b93 | | Status | running | | Source Compute | compute02 | | Source Node | compute02 | | Dest Compute | compute01 | | Dest Host | None | | Dest Node | compute01 | | Memory Total Bytes | 137448202240 | | Memory Processed Bytes | 5300502117730 | | Memory Remaining Bytes | 52182556672 | | Disk Total Bytes | 0 | | Disk Processed Bytes | 0 | | Disk Remaining Bytes | 0 | | Created At | 2026-03-24T10:49:15.000000 | | Updated At | 2026-03-24T13:14:32.000000 | | UUID | e2cda681-e25c-4aca-813c-3641bc6164c9 | | User ID | cc4367e52cce828fa3e378f29ed6df553c2dd99e9a4b33f1835fee719d592c91 | | Project ID | 0382d25c311149fabd7bea0d6fa3ac37 | +------------------------+------------------------------------------------------------------+
Hello stackers,
I'm reaching out to the community to understand how you manage migrations of memory-intensive instances. We are running RHOSP 17.1 (based on Wallaby) and have faced several issues when live-migrating instances running on compute nodes to do maintenance work. for memory intesive workloads (with or without hugepages) we recommend enabling post-copy live migration https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
Hello Sean, Thank you for your inputs, we did not override the defaults and have live_migration_permit_post_copy and live_migration_permit_auto_converge set to true. I ran some tests with live_migration_timeout_action set to force_complete, but it did not improve. We saw the downtimne induce by the force pause of the instance but sadly the instance and migration still ended in error state. Regarding hugepages, I'm not sure for all the instances that went in error state overt time, but the one I'm doing my test have the default of 2048 kB. I will look at the advance tunable you mention to see if it can help solve my issue. Regards, Timothé ________________________________ De : Sean Mooney <smooney@redhat.com> Envoyé : mardi 24 mars 2026 16:51 À : Timothé Baugé <timothe.bauge@linkt.fr>; openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Objet : Re: [nova] Live migration of memory-intensive workload in error On 24/03/2026 13:28, Timothé Baugé wrote: this is effectively mandatory if a guest is using hugepages but it highly recommend in general. i thought this was the default in rhosp 17.1 but perhaps it only became the default in 18. can you confirm you have not override the default and that you are using post-copy? https://github.com/openstack-archive/tripleo-heat-templates/blob/stable/wall... in 16.2 this was not enable by default and some rhosp user never updated there config for the new defaults.
For several memory-intensive instances, the process just never completes properly and ends in error after hours waiting. During the migration, we can see in nova-computes.log of destination compute node that the migration never truly migrates the memory, whenever the reported percentage of memory remaining is close to 0%, it ends up to go back to high percentage again.
By looking at the migration UUID, we saw that the memory processed bytes is way higher than the memory total bytes [1]. ya that is what happens if the guest is writign to the memroy and
We played with several `live_migration` options in nova.conf but to no end.
setting ``` [libvirt] live_migration_timeout_action=force_complete ``` can help with this but it has the sideeffect that when the timeout expires it can result in perceptible downtime in the guest because it will force pause the guess for the memory copy to happen. https://github.com/openstack-k8s-operators/nova-operator/blob/main/templates... is our new defualts downstream ``` live_migration_permit_post_copy=true live_migration_permit_auto_converge=true live_migration_timeout_action=force_complete ``` i think shoudl actully be the defaults upstream and we shoudl eveutlly consider removign the option to disabel those but we are even more conservitige in our upstream default then our already conservitive downstream defautls. dirtying pages at a higher rate then your network bandwith will allow if a guest dirties 1 byte in a memory page the entire page needs to be copied again. that tollerable for 4k pages without post_copy but its not for 1G hugepages and often is not feasible for 2mb hugespages either. does the vm have hugepages enabled? there are some more advance tunabels like live_migration_downtime live_migration_downtime_steps live_migration_downtime_delay that might be of use but i generally do not recommend changing those unless you ahve already enabeld post_copy and the force_complete timeout action
Has anyone faced the same issue ? And most importantly, how do you handle the migration of your memory-intensive workload ?
Regards, Timothé
[1]. $ openstack server migration show 328af6d9-9c9d-4671-a8f9-2c0d3df32b93 e2cda681-e25c-4aca-813c-3641bc6164c9 +------------------------+------------------------------------------------------------------+ | Field | Value | +------------------------+------------------------------------------------------------------+ | ID | 13231 | | Server UUID | 328af6d9-9c9d-4671-a8f9-2c0d3df32b93 | | Status | running | | Source Compute | compute02 | | Source Node | compute02 | | Dest Compute | compute01 | | Dest Host | None | | Dest Node | compute01 | | Memory Total Bytes | 137448202240 | | Memory Processed Bytes | 5300502117730 | | Memory Remaining Bytes | 52182556672 | | Disk Total Bytes | 0 | | Disk Processed Bytes | 0 | | Disk Remaining Bytes | 0 | | Created At | 2026-03-24T10:49:15.000000 | | Updated At | 2026-03-24T13:14:32.000000 | | UUID | e2cda681-e25c-4aca-813c-3641bc6164c9 | | User ID | cc4367e52cce828fa3e378f29ed6df553c2dd99e9a4b33f1835fee719d592c91 | | Project ID | 0382d25c311149fabd7bea0d6fa3ac37 | +------------------------+------------------------------------------------------------------+
参加者 (2)
-
Sean Mooney
-
Timothé Baugé