On Mon, 2023-09-18 at 12:13 +0200, Michel Jouvin wrote:
Hi,
post-copy looks to meas a very attractive approach for these heavy-loaded VMs but I didn't understand that there is an inherent risk of data-loss (except if there is an implementation bug)... Are you sure?
a simplifed view of how post-copy works is it intially does a copy of the vm memroy, then if the vm is not loaded it just sync complete the migration as normal, if it detect a substiatal delta in the memory since the inial copy happened it enters post copy mode. how post-copy mode work is the dirty pages are marked as dirty on the dest and the vm resumes form the dest, in the background qemu continue to copy the dirty pages form the souce to the dest and if the gust every tries to read form a dirty page it gets retrived on demand from the souce. all write in post copy mode are made to the dest.
the possibelty for data loss comos form 2 sources 1.) if the qemu process on the source craches or is terminated by say an OOM event on the source host then any uncopied memory is lost. 2.) if one of your top of rack switches explodes and you have a network partation or the connection over which the data is being copied is broken then the migraiton will fail.
so in both cases an external event causes the souce vm to be unreaachable form the dest. that means the runign vm on the dest cant access the required info. this can in some cases cause data-loss
without post-copy the senario 2 woudl not cause data loss and would have just caused the migration to be aborted. the vm woudl have continued to run on the souce node. senario 1 would have caused the data-loss regradless of doing a migrations so that is kind of irrelevent. i.e. if the vm gets killed as a resutl of an OOM event then any uncommited disk writes or any data in memory is goign to be lost even if you are not live migrating it.
you just have to descied if you feeel comfortable with the possiblity fo the vm crashing if there is a network partition. this should be a very very rare event or you have bigger probelmes in your datacenter then slow migrations but its why we dont ebale postcopy by default upstream. we do enable it by default in our downstream product for what its worth because we beilve the risk is minimal and when you are doing an emergency drainging fo a host due to hardware failures defaultign to a config that is likely to succeed is generally more desirable.
so tl;dr 2 is the reason for the data-loss comment.
Michel
Le 18/09/2023 à 11:29, Rafa a écrit :
As I can read and understand from the source compute logs, the memory is copied over successfully and there is no migration timeout. But after the instance is paused there is something wrong happening. I first thought it could be the short migration downtime (default=500ms), that's why I increased the "live_migration_downtime" to higher values (max was 300000ms; just for testing:) ) and nothing changed.
And the error message doesn't say much either.
I don't really want to use post-copy as it can lead to data loss.
Auto Converge doesn't seem to help either.
Am Fr., 15. Sept. 2023 um 13:56 Uhr schrieb smooney@redhat.com:
On Thu, 2023-09-14 at 19:07 +0200, Rafa wrote:
Hi, could you share logs from the target compute node as well?
yes, here: https://paste.openstack.org/show/blpJE8krA1N6PVaLTVF0/
if the vm is under heavy memory load then its advisable ot use post-copy live migration.
in general live migration is not intened to be used with a vm under load as there is no gurenettee that it will ever complete. post-copy live migration can signifcantly increae the probablity that a vm under load will live migrate in a reasonabel amount of time.
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...
auto converge can also help but tis less important
https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.liv...