On 12/09/2025 09:59, Karl Kloppenborg wrote:
Hi Openstack Teams,
We’re attempting to live-migrate instances off of a node, we continuously hit a timeout issue where memory copy doesn’t work: 2025-09-12 08:55:20.116 971204 INFO nova.virt.libvirt.driver [None req-250bc815-3323-4619-aef8-a11fdf92b27a 405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] Migration running for 274 secs, memory 100% remaining (bytes processed=281505, remaining=8595255296, total=8604033024); disk 100% remaining (bytes processed=0, remaining=0, total=0). 2025-09-12 08:55:58.807 971204 INFO nova.virt.libvirt.driver [None req-250bc815-3323-4619-aef8-a11fdf92b27a 405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] Migration running for 312 secs, memory 100% remaining (bytes processed=281505, remaining=8595255296, total=8604033024); disk 100% remaining (bytes processed=0, remaining=0, total=0). 2025-09-12 08:56:02.468 971204 INFO nova.compute.manager [None req-1f91a8d6-1fa7-47b2-8c3f-70925fb7a219 - - - - - -] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] During sync_power_state the instance has a pending task (migrating). Skip.
2025-09-12 08:56:37.985 971204 INFO nova.virt.libvirt.driver [None req-250bc815-3323-4619-aef8-a11fdf92b27a 405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] Migration running for 352 secs, memory 100% remaining (bytes processed=281505, remaining=8595255296, total=8604033024); disk 100% remaining (bytes processed=0, remaining=0, total=0). 2025-09-12 08:57:17.377 971204 INFO nova.virt.libvirt.driver [None req-250bc815-3323-4619-aef8-a11fdf92b27a 405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] Migration running for 391 secs, memory 100% remaining (bytes processed=281505, remaining=8595255296, total=8604033024); disk 100% remaining (bytes processed=0, remaining=0, total=0). 2025-09-12 08:57:56.877 971204 INFO nova.virt.libvirt.driver [None req-250bc815-3323-4619-aef8-a11fdf92b27a 405874cabc5b4bf3912a5f89d54eb0d1 21eb701c2a1f48b38dab8f34c0a20902 - - default default] [instance: 2f496843-a2c4-48d4-bbdc-149a2ea76f1c] Migration running for 431 secs, memory 100% remaining (bytes processed=281505, remaining=8595255296, total=8604033024); disk 100% remaining (bytes processed=0, remaining=0, total=0).
Has anyone got insights for this issue?
|This most often happens when the guest you're trying to move is actively mutating state. The way to mitigate this is to generally use auto-converge, which adds pauses to the guest CPU execution to allow the migration to make forward progress at the expense of degraded guest performance. Alternatively, you can use the more powerful/efficient post-copy mechanism: https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... Post-copy first tries to pre-copy the guest memory to the destination. If it gets into a state where the guest is modifying memory faster than it can be transferred, it will swap execution to the destination VM. This causes all writes to stay local to the destination VM and all reads that have not been transferred yet to be pulled on-demand from the source VM. https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... Nova doesn't largely control how the memory is transferred; we just ask the hypervisor (libvirt/qemu in this case) to perform a migration and then monitor it for completion. We are not controlling how the transfer works beyond that. What's a little odd in your case is that the stats are not changing beyond the time. This implies that qemu is not able to transfer any memory at all, which suggests you're hitting some kind of internal qemu bug or limitation. For example, if the guest memory is using 1G hugepages and the guest keeps dirtying the pages, then the "remaining" value may stay the same or even increase. However, I would expect the "processed" value to increase as it tries to transfer the page over and over again, having to restart the transfer every time the page is modified. This is the problem that post-copy was designed to fix. This generally doesn't matter for small pages (i.e., 4k pages that we use by default). Retransferring a 4k page is quick and forward progress can be made. If you're using 2MB or 1GB hugepages, however, you need much, much higher network bandwidth to make any progress. If you have not tried using post-copy, I would recommend trying to enable it and see if it helps. The other possibility is if you are using vGPU (i.e., generic mdevs or the new live migration of a PCI device that uses the vfio-variant driver with managed=no). The stats do not include the memory being transferred for those passthrough devices. To make live migration work in that case, you need to adjust the allowed downtime: https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv... In our vGPU docs, we suggest: https://docs.openstack.org/nova/latest/admin/virtual-gpu.html#caveats live_migration_downtime = 500000 live_migration_downtime_steps = 3 live_migration_downtime_delay = 3 500000 is a ridiculously large value for that setting that basically tells libvirt/qemu it can take as much time as it needs to transfer the memory and, in this case, pause the guest for a little over 8 minutes of total downtime. The number was chosen by adding a few zeros to our default of 500ms of total downtime. Setting it to the 2000- 10000ms range might be a more reasonable value. Putting this all together, with a tweak to https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv..., I think a reasonable config to run in production might be: | [libvirt] live_migration_permit_post_copy = true live_migration_downtime = 4000 live_migration_downtime_steps = 5 live_migration_downtime_delay = 15 live_migration_timeout_action = force_complete |However, I would advise reading the help text for each of those options to understand what this is doing and evaluate if it fits your workload/ SLA requirements. |
Your help is greatly appreciated.
Thanks, Karl.
Karl Kloppenborg
Chief Technology Officer
m: _+61 437 239 565_ _resetdata.com <https://resetdata.com/>_
reset.png
*ResetData supports Mandatory Client Related Financial Disclosures – Scope 3 Emissions Reporting *For more information on the phasing of these requirements for business please visit; _https://treasury.gov.au/sites/default/files/2024-01/c2024-466491-policy-stat... <https://treasury.gov.au/sites/default/files/2024-01/c2024-466491-policy-state.pdf>_
This email transmission is intended only for the addressee / person responsible for delivery of the message to such person and may contain confidential or privileged information. Confidentiality and legal privilege are not waived or lost by reason of mistaken delivery to you, nor may you use, review, disclose, disseminate or copy any information contained in or attached to it. Whilst this email has been checked for viruses, the sender does not warrant that any attachments are free from viruses or other defects. You assume all liability for any loss, damage or other consequences which may arise from opening or using the attachments. If you received this e-mail in error please delete it and any attachments and kindly notify us by immediately sending an email to _contact@resetdata.com.au_