[nova] Live migration of a RAM intensive instance failed
Hello, I have a volume-backed instance with 16 vCPU and 64GB RAM. The instance uses 60GB RAM out of 64GB (used: 22GB; buff/cache 38GB). When I do a live migration of this instance, it fails without any timeouts. It copies almost all the RAM (within 150 - 250 seconds) to the target compute host without any problems according to the logs. Then the instance is paused to copy the rest of the RAM. Everything seems to be working correctly up to this point, but then the instance resumes and the following error message appears:
Live Migration failure: operation failed: migration out job: unexpectedly failed: libvirt.libvirtError: operation failed: migration out job: unexpectedly failed
Unfortunately, this error message does not say much. It doesn't look like it's due to any timeouts or short downtimes, but I still tested different (higher) values for the following configurations. Unfortunately without success. - live_migration_completion_timeout - live_migration_timeout_action: abort / force_complete (pause) - live_migration_downtime - live_migration_downtime_steps - live_migration_downtime_delay - live_migration_permit_auto_converge: True / False
All other instances on the same source and destination hosts can be live migrated without any issues. This instance can also be successfully live migrated after a restart, as it is probably not yet heavily loaded. After a few hours, however, the live migration no longer works.
Any ideas what the problem could be?
Logs: - nova-compute.log from source compute host: https://paste.openstack.org/show/bJMFxnPKQBEVaPakud61/ - i found this Traceback using journalctrl: https://paste.openstack.org/show/bIS4GFAd2RJ5fHVN9I8d/ - there was also an error in /var/log/libvirt/qemu/: https://paste.openstack.org/show/bImT89IelDcXXBPSgTCO/
Enviroment: - Libvirt: 8.0.0 - QEMU: 4.2.1 - Nova: 25.1.1 - OpenStack: Yoga - Compute operating system: Ubuntu 20.04
Hi, could you share logs from the target compute node as well?
Zitat von Rafa rafaa.haji3@gmail.com:
Hello, I have a volume-backed instance with 16 vCPU and 64GB RAM. The instance uses 60GB RAM out of 64GB (used: 22GB; buff/cache 38GB). When I do a live migration of this instance, it fails without any timeouts. It copies almost all the RAM (within 150 - 250 seconds) to the target compute host without any problems according to the logs. Then the instance is paused to copy the rest of the RAM. Everything seems to be working correctly up to this point, but then the instance resumes and the following error message appears:
Live Migration failure: operation failed: migration out job: unexpectedly failed: libvirt.libvirtError: operation failed: migration out job: unexpectedly failed
Unfortunately, this error message does not say much. It doesn't look like it's due to any timeouts or short downtimes, but I still tested different (higher) values for the following configurations. Unfortunately without success.
- live_migration_completion_timeout
- live_migration_timeout_action: abort / force_complete (pause)
- live_migration_downtime
- live_migration_downtime_steps
- live_migration_downtime_delay
- live_migration_permit_auto_converge: True / False
All other instances on the same source and destination hosts can be live migrated without any issues. This instance can also be successfully live migrated after a restart, as it is probably not yet heavily loaded. After a few hours, however, the live migration no longer works.
Any ideas what the problem could be?
Logs:
- nova-compute.log from source compute host:
https://paste.openstack.org/show/bJMFxnPKQBEVaPakud61/
- i found this Traceback using journalctrl:
https://paste.openstack.org/show/bIS4GFAd2RJ5fHVN9I8d/
- there was also an error in /var/log/libvirt/qemu/:
https://paste.openstack.org/show/bImT89IelDcXXBPSgTCO/
Enviroment:
- Libvirt: 8.0.0
- QEMU: 4.2.1
- Nova: 25.1.1
- OpenStack: Yoga
- Compute operating system: Ubuntu 20.04
Hi, could you share logs from the target compute node as well?
yes, here: https://paste.openstack.org/show/blpJE8krA1N6PVaLTVF0/
On Thu, 2023-09-14 at 19:07 +0200, Rafa wrote:
Hi, could you share logs from the target compute node as well?
yes, here: https://paste.openstack.org/show/blpJE8krA1N6PVaLTVF0/
if the vm is under heavy memory load then its advisable ot use post-copy live migration.
in general live migration is not intened to be used with a vm under load as there is no gurenettee that it will ever complete. post-copy live migration can signifcantly increae the probablity that a vm under load will live migrate in a reasonabel amount of time.
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
auto converge can also help but tis less important
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
As I can read and understand from the source compute logs, the memory is copied over successfully and there is no migration timeout. But after the instance is paused there is something wrong happening. I first thought it could be the short migration downtime (default=500ms), that's why I increased the "live_migration_downtime" to higher values (max was 300000ms; just for testing:) ) and nothing changed.
And the error message doesn't say much either.
I don't really want to use post-copy as it can lead to data loss.
Auto Converge doesn't seem to help either.
Am Fr., 15. Sept. 2023 um 13:56 Uhr schrieb smooney@redhat.com:
On Thu, 2023-09-14 at 19:07 +0200, Rafa wrote:
Hi, could you share logs from the target compute node as well?
yes, here: https://paste.openstack.org/show/blpJE8krA1N6PVaLTVF0/
if the vm is under heavy memory load then its advisable ot use post-copy live migration.
in general live migration is not intened to be used with a vm under load as there is no gurenettee that it will ever complete. post-copy live migration can signifcantly increae the probablity that a vm under load will live migrate in a reasonabel amount of time.
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
auto converge can also help but tis less important
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
Hi,
post-copy looks to meas a very attractive approach for these heavy-loaded VMs but I didn't understand that there is an inherent risk of data-loss (except if there is an implementation bug)... Are you sure?
Michel
Le 18/09/2023 à 11:29, Rafa a écrit :
As I can read and understand from the source compute logs, the memory is copied over successfully and there is no migration timeout. But after the instance is paused there is something wrong happening. I first thought it could be the short migration downtime (default=500ms), that's why I increased the "live_migration_downtime" to higher values (max was 300000ms; just for testing:) ) and nothing changed.
And the error message doesn't say much either.
I don't really want to use post-copy as it can lead to data loss.
Auto Converge doesn't seem to help either.
Am Fr., 15. Sept. 2023 um 13:56 Uhr schrieb smooney@redhat.com:
On Thu, 2023-09-14 at 19:07 +0200, Rafa wrote:
Hi, could you share logs from the target compute node as well?
yes, here: https://paste.openstack.org/show/blpJE8krA1N6PVaLTVF0/
if the vm is under heavy memory load then its advisable ot use post-copy live migration.
in general live migration is not intened to be used with a vm under load as there is no gurenettee that it will ever complete. post-copy live migration can signifcantly increae the probablity that a vm under load will live migrate in a reasonabel amount of time.
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
auto converge can also help but tis less important
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
On Mon, 2023-09-18 at 12:13 +0200, Michel Jouvin wrote:
Hi,
post-copy looks to meas a very attractive approach for these heavy-loaded VMs but I didn't understand that there is an inherent risk of data-loss (except if there is an implementation bug)... Are you sure?
a simplifed view of how post-copy works is it intially does a copy of the vm memroy, then if the vm is not loaded it just sync complete the migration as normal, if it detect a substiatal delta in the memory since the inial copy happened it enters post copy mode. how post-copy mode work is the dirty pages are marked as dirty on the dest and the vm resumes form the dest, in the background qemu continue to copy the dirty pages form the souce to the dest and if the gust every tries to read form a dirty page it gets retrived on demand from the souce. all write in post copy mode are made to the dest.
the possibelty for data loss comos form 2 sources 1.) if the qemu process on the source craches or is terminated by say an OOM event on the source host then any uncopied memory is lost. 2.) if one of your top of rack switches explodes and you have a network partation or the connection over which the data is being copied is broken then the migraiton will fail.
so in both cases an external event causes the souce vm to be unreaachable form the dest. that means the runign vm on the dest cant access the required info. this can in some cases cause data-loss
without post-copy the senario 2 woudl not cause data loss and would have just caused the migration to be aborted. the vm woudl have continued to run on the souce node. senario 1 would have caused the data-loss regradless of doing a migrations so that is kind of irrelevent. i.e. if the vm gets killed as a resutl of an OOM event then any uncommited disk writes or any data in memory is goign to be lost even if you are not live migrating it.
you just have to descied if you feeel comfortable with the possiblity fo the vm crashing if there is a network partition. this should be a very very rare event or you have bigger probelmes in your datacenter then slow migrations but its why we dont ebale postcopy by default upstream. we do enable it by default in our downstream product for what its worth because we beilve the risk is minimal and when you are doing an emergency drainging fo a host due to hardware failures defaultign to a config that is likely to succeed is generally more desirable.
so tl;dr 2 is the reason for the data-loss comment.
Michel
Le 18/09/2023 à 11:29, Rafa a écrit :
As I can read and understand from the source compute logs, the memory is copied over successfully and there is no migration timeout. But after the instance is paused there is something wrong happening. I first thought it could be the short migration downtime (default=500ms), that's why I increased the "live_migration_downtime" to higher values (max was 300000ms; just for testing:) ) and nothing changed.
And the error message doesn't say much either.
I don't really want to use post-copy as it can lead to data loss.
Auto Converge doesn't seem to help either.
Am Fr., 15. Sept. 2023 um 13:56 Uhr schrieb smooney@redhat.com:
On Thu, 2023-09-14 at 19:07 +0200, Rafa wrote:
Hi, could you share logs from the target compute node as well?
yes, here: https://paste.openstack.org/show/blpJE8krA1N6PVaLTVF0/
if the vm is under heavy memory load then its advisable ot use post-copy live migration.
in general live migration is not intened to be used with a vm under load as there is no gurenettee that it will ever complete. post-copy live migration can signifcantly increae the probablity that a vm under load will live migrate in a reasonabel amount of time.
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
auto converge can also help but tis less important
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
I didn't mean to imply that post-copy always results in data loss. But this is possible if the network connection between source and destination host is disconnected during the post-copy operation.
Am Mo., 18. Sept. 2023 um 12:14 Uhr schrieb Michel Jouvin michel.jouvin@ijclab.in2p3.fr:
Hi,
post-copy looks to meas a very attractive approach for these heavy-loaded VMs but I didn't understand that there is an inherent risk of data-loss (except if there is an implementation bug)... Are you sure?
Michel
Le 18/09/2023 à 11:29, Rafa a écrit :
As I can read and understand from the source compute logs, the memory is copied over successfully and there is no migration timeout. But after the instance is paused there is something wrong happening. I first thought it could be the short migration downtime (default=500ms), that's why I increased the "live_migration_downtime" to higher values (max was 300000ms; just for testing:) ) and nothing changed.
And the error message doesn't say much either.
I don't really want to use post-copy as it can lead to data loss.
Auto Converge doesn't seem to help either.
Am Fr., 15. Sept. 2023 um 13:56 Uhr schrieb smooney@redhat.com:
On Thu, 2023-09-14 at 19:07 +0200, Rafa wrote:
Hi, could you share logs from the target compute node as well?
yes, here: https://paste.openstack.org/show/blpJE8krA1N6PVaLTVF0/
if the vm is under heavy memory load then its advisable ot use post-copy live migration.
in general live migration is not intened to be used with a vm under load as there is no gurenettee that it will ever complete. post-copy live migration can signifcantly increae the probablity that a vm under load will live migrate in a reasonabel amount of time.
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
auto converge can also help but tis less important
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
participants (4)
-
Eugen Block
-
Michel Jouvin
-
Rafa
-
smooney@redhat.com