[nova] Live migration of a RAM intensive instance failed

smooney at redhat.com smooney at redhat.com
Mon Sep 18 10:42:41 UTC 2023


On Mon, 2023-09-18 at 12:13 +0200, Michel Jouvin wrote:
> Hi,
> 
> post-copy looks to meas a very attractive approach for these 
> heavy-loaded VMs but I didn't understand that there is an inherent risk 
> of data-loss (except if there is an implementation bug)... Are you sure?

a simplifed  view of how post-copy works is it intially does a copy of the vm memroy, then if the vm is not
loaded it just sync complete the migration as normal, if it detect a substiatal delta in the memory since the inial copy
happened it enters post copy mode. how post-copy mode work is the dirty pages are marked as dirty on the dest and the
vm resumes form the dest, in the background qemu continue to copy the dirty pages form the souce to the dest
and if the gust every tries to read form a dirty page it gets retrived on demand from the souce.
all write in post copy mode are made to the dest.

the possibelty for data loss comos form 2 sources
1.) if the qemu process on the source craches or is terminated by say an OOM event on the source host then
any uncopied memory is lost.
2.) if one of your top of rack switches explodes and you have a network partation or the connection
over which the data is being copied is broken then the migraiton will fail.

so in both cases an external event causes the souce vm to be unreaachable form the dest.
that means the runign vm on the dest cant access the required info. this can in some cases cause data-loss

without post-copy the senario 2 woudl not cause data loss and would have just caused the migration to be aborted.
the vm woudl have continued to run on the souce node. senario 1 would have caused the data-loss
regradless of doing a migrations so that is kind of irrelevent. i.e. if the vm gets killed as a resutl of an OOM event
then any uncommited disk writes or any data in memory is goign to be lost even if you are not live migrating it.

you just have to descied if you feeel comfortable with the possiblity fo the vm crashing if there is a network
partition. this should be a very very rare event or you have bigger probelmes in your datacenter then slow migrations
but its why we dont ebale postcopy by default upstream. we do enable it by default in our downstream product for what
its worth because we beilve the risk is minimal and when you are doing an emergency drainging fo a host due to hardware
failures defaultign to a config that is likely to succeed is generally more desirable.

so tl;dr 2 is the reason for the data-loss comment.

> 
> Michel
> 
> Le 18/09/2023 à 11:29, Rafa a écrit :
> > As I can read and understand from the source compute logs,
> > the memory is copied over successfully and there is no migration timeout.
> > But after the instance is paused there is something wrong happening.
> > I first thought it could be the short migration downtime (default=500ms),
> > that's why I increased the "live_migration_downtime" to higher values
> > (max was 300000ms; just for testing:) ) and nothing changed.
> > 
> > And the error message doesn't say much either.
> > 
> > I don't really want to use post-copy as it can lead to data loss.
> > 
> > Auto Converge doesn't seem to help either.
> > 
> > 
> > 
> > Am Fr., 15. Sept. 2023 um 13:56 Uhr schrieb <smooney at redhat.com>:
> > > On Thu, 2023-09-14 at 19:07 +0200, Rafa wrote:
> > > > > Hi, could you share logs from the target compute node as well?
> > > > yes, here: https://paste.openstack.org/show/blpJE8krA1N6PVaLTVF0/
> > > > 
> > > if the vm is under heavy memory load then its advisable ot use post-copy live migration.
> > > 
> > > in general live migration is not intened to be used with a vm under load
> > > as there is no gurenettee that it will ever complete. post-copy live migration
> > > can signifcantly increae the probablity that a vm under load will live migrate
> > > in a reasonabel amount of time.
> > > 
> > > https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_permit_post_copy
> > > 
> > > auto converge can also help but tis less important
> > > 
> > > https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.live_migration_permit_auto_converge
> > > 
> 




More information about the openstack-discuss mailing list