Re: Live Migration problems

22 Nov 2024


      Le 2024-11-22 11:22, Tobias Urdin a écrit :
...
Hello,
Hello Tobias,
...
1. You're probably stuck on copying memory because the instance
is using memory faster than you  can migrate or pagefault it over.
This can be observed with virsh domjobinfo <instance> on the compute
node. We optimize the live migration downtime stepping in nova.conf
to work around that and after a while we just force it over but
your use-case might not allow that.
Yes, this is the case, but I thought live_migration_permit_auto_converge 
would solve that.
And the other problem is the libvirt stacktrace when I force the 
completion...
...
2. Not sure since it's using Linux bridge agent which has been moved
to experimental, you probably want to schedule migrating away from 
using
it. Look into the enable_qemu_monitor_announce_self in nova.conf so
that nova in post-migration does an QEMU announce_self that sends out
RARP frames after the migration is complete if there is a race
condition between the RARP frames being sent out and the port
being bound, which is the case for us when using OVS agent.
Mmm, that's interesting. I didn't see that parameter.
I'm trying to have a reproducible test and will try 
enable_qemu_monitor_announce_self.
I'll come back here if I have a clear answer.
...
/Tobias
Thank you tobias.
-- 
Gilles
...
On Thu, Nov 21, 2024 at 06:12:26PM UTC, Gilles Mocellin wrote:
...
Hi Stackers,
I have 2 recurring problems with live migration.
If someone already face that and have recommendations...
My env :
- OpenStack 2023.1, deployed by OpenStack-Ansible 27.5.1
- Network : ML2/LinuxBridges
- Some shared Networks on a VLAN provider
- Most private networks on VXLAN
- OS : Ubuntu 20.04
- libvirt-daemon-system: 8.0.0-1ubuntu7.10~cloud0
- qemu-system-x86: 1:4.2-3ubuntu6.30
Each time we have maintenance to do on compute nodes, we live-migrate 
the
instances, with `nova host-evacuate-live` (still no openstack CLI 
command to
do that).
1. Never ending / failed migrations
In general, it works, but each time we have 1 or 2 (on 40 instances) 
that
don't want to complete.
It's some memory active instances (Kubernetes controller nodes with
etcd...).
Even if I had configured live_migration_permit_auto_converge to True.
Now, if we try to force complete (openstack server migration force
complete), migration fails, instance return to ACTIVE status on 
source.
And I can see libvirt errors :
 [...] libvirt.libvirtError: operation failed: migration out job:
unexpectedly failed
Not really helpful.
I tried removing live_migration_permit_auto_converge, reduce
live_migration_completion_timeout to 60 and
set live_migration_timeout_action to force_complete.
Same problems.
2. Instance unreachable
More problematic, as it's not easy to see : some instances are not 
reachable
after migration.
And after a while, DHCP leases expire.
The only way to recover is to restart neutron-linuxbridge-agent on the
instance compute node.
In fact, on all compute nodes, because instances evacuation spread 
them on
all the compute nodes.
Then, if DHCP has not expired, the instance is joinable again.
We made a script to find those instances, it lists all ports bound to 
a nova
instance.
Then connects to each control/network nodes, enter qdhcp network 
namespaces
(found with the MAC adresse in DHCP leases files), and try to ping the
instance private IP.
The only log I see, but don't know if it has something to do with that 
is :
  Device tapxxx-yy has no active binding in host
But I can see that message even since, without unjoinable instances...
Here it is, if something ring a bell, for an existing bug or 
configuration
error, please let us know !
Thanks.
--
Gilles