Re: Live-migration never completes memory copy

12 Sep 2025

      On 12/09/2025 09:59, Karl Kloppenborg wrote:
...
Hi Openstack Teams,
We’re attempting to live-migrate instances off of a node, we 
continuously hit a timeout issue where memory copy doesn’t work:
2025-09-12 08:55:20.116 971204 INFO nova.virt.libvirt.driver [None 
req-250bc815-3323-4619-aef8-a11fdf92b27a 
405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - 
default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] 
Migration running for 274 secs, memory 100% remaining (bytes 
processed=281505, remaining=8595255296, total=8604033024); disk 100% 
remaining (bytes processed=0, remaining=0, total=0).
2025-09-12 08:55:58.807 971204 INFO nova.virt.libvirt.driver [None 
req-250bc815-3323-4619-aef8-a11fdf92b27a 
405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - 
default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] 
Migration running for 312 secs, memory 100% remaining (bytes 
processed=281505, remaining=8595255296, total=8604033024); disk 100% 
remaining (bytes processed=0, remaining=0, total=0).
2025-09-12 08:56:02.468 971204 INFO nova.compute.manager [None 
req-1f91a8d6-1fa7-47b2-8c3f-70925fb7a219 - - - - - -] [instance: 
2f496843-a2c4-48d4-bbdc-149a2ea76f1c] During sync_power_state the 
instance has a pending task (migrating). Skip.
2025-09-12 08:56:37.985 971204 INFO nova.virt.libvirt.driver [None 
req-250bc815-3323-4619-aef8-a11fdf92b27a 
405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - 
default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] 
Migration running for 352 secs, memory 100% remaining (bytes 
processed=281505, remaining=8595255296, total=8604033024); disk 100% 
remaining (bytes processed=0, remaining=0, total=0).
2025-09-12 08:57:17.377 971204 INFO nova.virt.libvirt.driver [None 
req-250bc815-3323-4619-aef8-a11fdf92b27a 
405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - 
default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] 
Migration running for 391 secs, memory 100% remaining (bytes 
processed=281505, remaining=8595255296, total=8604033024); disk 100% 
remaining (bytes processed=0, remaining=0, total=0).
2025-09-12 08:57:56.877 971204 INFO nova.virt.libvirt.driver [None 
req-250bc815-3323-4619-aef8-a11fdf92b27a 
405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - 
default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] 
Migration running for 431 secs, memory 100% remaining (bytes 
processed=281505, remaining=8595255296, total=8604033024); disk 100% 
remaining (bytes processed=0, remaining=0, total=0).
Has anyone got insights for this issue?
|This most often happens when the guest you're trying to move is 
actively mutating state. The way to mitigate this is to generally use 
auto-converge, which adds pauses to the guest CPU execution to allow the 
migration to make forward progress at the expense of degraded guest 
performance. Alternatively, you can use the more powerful/efficient 
post-copy mechanism: 
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... 
Post-copy first tries to pre-copy the guest memory to the destination. 
If it gets into a state where the guest is modifying memory faster than 
it can be transferred, it will swap execution to the destination VM. 
This causes all writes to stay local to the destination VM and all reads 
that have not been transferred yet to be pulled on-demand from the 
source VM. 
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... 
Nova doesn't largely control how the memory is transferred; we just ask 
the hypervisor (libvirt/qemu in this case) to perform a migration and 
then monitor it for completion. We are not controlling how the transfer 
works beyond that. What's a little odd in your case is that the stats 
are not changing beyond the time. This implies that qemu is not able to 
transfer any memory at all, which suggests you're hitting some kind of 
internal qemu bug or limitation. For example, if the guest memory is 
using 1G hugepages and the guest keeps dirtying the pages, then the 
"remaining" value may stay the same or even increase. However, I would 
expect the "processed" value to increase as it tries to transfer the 
page over and over again, having to restart the transfer every time the 
page is modified. This is the problem that post-copy was designed to 
fix. This generally doesn't matter for small pages (i.e., 4k pages that 
we use by default). Retransferring a 4k page is quick and forward 
progress can be made. If you're using 2MB or 1GB hugepages, however, you 
need much, much higher network bandwidth to make any progress. If you 
have not tried using post-copy, I would recommend trying to enable it 
and see if it helps. The other possibility is if you are using vGPU 
(i.e., generic mdevs or the new live migration of a PCI device that uses 
the vfio-variant driver with managed=no). The stats do not include the 
memory being transferred for those passthrough devices. To make live 
migration work in that case, you need to adjust the allowed downtime: 
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... 
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... 
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... 
In our vGPU docs, we suggest: 
https://docs.openstack.org/nova/latest/admin/virtual-gpu.html#caveats 
live_migration_downtime = 500000 live_migration_downtime_steps = 3 
live_migration_downtime_delay = 3 500000 is a ridiculously large value 
for that setting that basically tells libvirt/qemu it can take as much 
time as it needs to transfer the memory and, in this case, pause the 
guest for a little over 8 minutes of total downtime. The number was 
chosen by adding a few zeros to our default of 500ms of total downtime. 
Setting it to the 2000- 10000ms range might be a more reasonable value. 
Putting this all together, with a tweak to 
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv..., 
I think a reasonable config to run in production might be: |

[libvirt]

live_migration_permit_post_copy = true

live_migration_downtime = 4000

live_migration_downtime_steps = 5

live_migration_downtime_delay = 15

live_migration_timeout_action = force_complete

|However, I would advise reading the help text for each of those options 
to understand what this is doing and evaluate if it fits your workload/ 
SLA requirements. |
...
Your help is greatly appreciated.
Thanks,
Karl.
Karl Kloppenborg
Chief Technology Officer
m: _+61 437 239 565_
_resetdata.com <https://resetdata.com/>_
reset.png
*ResetData supports Mandatory Client Related Financial Disclosures – 
Scope 3 Emissions Reporting
*For more information on the phasing of these requirements for 
business please visit;
_https://treasury.gov.au/sites/default/files/2024-01/c2024-466491-policy-stat... 
<https://treasury.gov.au/sites/default/files/2024-01/c2024-466491-policy-state.pdf>_
This email transmission is intended only for the addressee / person 
responsible for delivery of the message to such person and may contain 
confidential or privileged information. Confidentiality and legal 
privilege are not waived or lost by reason of mistaken delivery to 
you, nor may you use, review, disclose, disseminate or copy any 
information contained in or attached to it. Whilst this email has been 
checked for viruses, the sender does not warrant that any attachments 
are free from viruses or other defects. You assume all liability for 
any loss, damage or other consequences which may arise from opening or 
using the attachments. If you received this e-mail in error please 
delete it and any attachments and kindly notify us by immediately 
sending an email to _contact@resetdata.com.au_

Re: Live-migration never completes memory copy

Sean Mooney