Live Migration problems

21 Nov 2024

      Hi Stackers,

I have 2 recurring problems with live migration.
If someone already face that and have recommendations...

My env :
- OpenStack 2023.1, deployed by OpenStack-Ansible 27.5.1
- Network : ML2/LinuxBridges
- Some shared Networks on a VLAN provider
- Most private networks on VXLAN
- OS : Ubuntu 20.04
- libvirt-daemon-system: 8.0.0-1ubuntu7.10~cloud0
- qemu-system-x86: 1:4.2-3ubuntu6.30

Each time we have maintenance to do on compute nodes, we live-migrate 
the instances, with `nova host-evacuate-live` (still no openstack CLI 
command to do that).

1. Never ending / failed migrations

In general, it works, but each time we have 1 or 2 (on 40 instances) 
that don't want to complete.
It's some memory active instances (Kubernetes controller nodes with 
etcd...).
Even if I had configured live_migration_permit_auto_converge to True.

Now, if we try to force complete (openstack server migration force 
complete), migration fails, instance return to ACTIVE status on source.
And I can see libvirt errors :
  [...] libvirt.libvirtError: operation failed: migration out job: 
unexpectedly failed

Not really helpful.

I tried removing live_migration_permit_auto_converge, reduce 
live_migration_completion_timeout to 60 and
set live_migration_timeout_action to force_complete.

Same problems.

2. Instance unreachable

More problematic, as it's not easy to see : some instances are not 
reachable after migration.
And after a while, DHCP leases expire.

The only way to recover is to restart neutron-linuxbridge-agent on the 
instance compute node.
In fact, on all compute nodes, because instances evacuation spread them 
on all the compute nodes.

Then, if DHCP has not expired, the instance is joinable again.

We made a script to find those instances, it lists all ports bound to a 
nova instance.
Then connects to each control/network nodes, enter qdhcp network 
namespaces (found with the MAC adresse in DHCP leases files), and try to 
ping the instance private IP.

The only log I see, but don't know if it has something to do with that 
is :
   Device tapxxx-yy has no active binding in host

But I can see that message even since, without unjoinable instances...

Here it is, if something ring a bell, for an existing bug or 
configuration error, please let us know !
Thanks.

--
Gilles

Gilles Mocellin

Tobias Urdin

Gilles Mocellin

Sean Mooney

Gilles Mocellin

tags

participants (3)