Live Migration problems
Hi Stackers, I have 2 recurring problems with live migration. If someone already face that and have recommendations... My env : - OpenStack 2023.1, deployed by OpenStack-Ansible 27.5.1 - Network : ML2/LinuxBridges - Some shared Networks on a VLAN provider - Most private networks on VXLAN - OS : Ubuntu 20.04 - libvirt-daemon-system: 8.0.0-1ubuntu7.10~cloud0 - qemu-system-x86: 1:4.2-3ubuntu6.30 Each time we have maintenance to do on compute nodes, we live-migrate the instances, with `nova host-evacuate-live` (still no openstack CLI command to do that). 1. Never ending / failed migrations In general, it works, but each time we have 1 or 2 (on 40 instances) that don't want to complete. It's some memory active instances (Kubernetes controller nodes with etcd...). Even if I had configured live_migration_permit_auto_converge to True. Now, if we try to force complete (openstack server migration force complete), migration fails, instance return to ACTIVE status on source. And I can see libvirt errors : [...] libvirt.libvirtError: operation failed: migration out job: unexpectedly failed Not really helpful. I tried removing live_migration_permit_auto_converge, reduce live_migration_completion_timeout to 60 and set live_migration_timeout_action to force_complete. Same problems. 2. Instance unreachable More problematic, as it's not easy to see : some instances are not reachable after migration. And after a while, DHCP leases expire. The only way to recover is to restart neutron-linuxbridge-agent on the instance compute node. In fact, on all compute nodes, because instances evacuation spread them on all the compute nodes. Then, if DHCP has not expired, the instance is joinable again. We made a script to find those instances, it lists all ports bound to a nova instance. Then connects to each control/network nodes, enter qdhcp network namespaces (found with the MAC adresse in DHCP leases files), and try to ping the instance private IP. The only log I see, but don't know if it has something to do with that is : Device tapxxx-yy has no active binding in host But I can see that message even since, without unjoinable instances... Here it is, if something ring a bell, for an existing bug or configuration error, please let us know ! Thanks. -- Gilles
Hello, 1. You're probably stuck on copying memory because the instance is using memory faster than you can migrate or pagefault it over. This can be observed with virsh domjobinfo <instance> on the compute node. We optimize the live migration downtime stepping in nova.conf to work around that and after a while we just force it over but your use-case might not allow that. 2. Not sure since it's using Linux bridge agent which has been moved to experimental, you probably want to schedule migrating away from using it. Look into the enable_qemu_monitor_announce_self in nova.conf so that nova in post-migration does an QEMU announce_self that sends out RARP frames after the migration is complete if there is a race condition between the RARP frames being sent out and the port being bound, which is the case for us when using OVS agent. /Tobias On Thu, Nov 21, 2024 at 06:12:26PM UTC, Gilles Mocellin wrote:
Hi Stackers,
I have 2 recurring problems with live migration. If someone already face that and have recommendations...
My env : - OpenStack 2023.1, deployed by OpenStack-Ansible 27.5.1 - Network : ML2/LinuxBridges - Some shared Networks on a VLAN provider - Most private networks on VXLAN - OS : Ubuntu 20.04 - libvirt-daemon-system: 8.0.0-1ubuntu7.10~cloud0 - qemu-system-x86: 1:4.2-3ubuntu6.30
Each time we have maintenance to do on compute nodes, we live-migrate the instances, with `nova host-evacuate-live` (still no openstack CLI command to do that).
1. Never ending / failed migrations
In general, it works, but each time we have 1 or 2 (on 40 instances) that don't want to complete. It's some memory active instances (Kubernetes controller nodes with etcd...). Even if I had configured live_migration_permit_auto_converge to True.
Now, if we try to force complete (openstack server migration force complete), migration fails, instance return to ACTIVE status on source. And I can see libvirt errors : [...] libvirt.libvirtError: operation failed: migration out job: unexpectedly failed
Not really helpful.
I tried removing live_migration_permit_auto_converge, reduce live_migration_completion_timeout to 60 and set live_migration_timeout_action to force_complete.
Same problems.
2. Instance unreachable
More problematic, as it's not easy to see : some instances are not reachable after migration. And after a while, DHCP leases expire.
The only way to recover is to restart neutron-linuxbridge-agent on the instance compute node. In fact, on all compute nodes, because instances evacuation spread them on all the compute nodes.
Then, if DHCP has not expired, the instance is joinable again.
We made a script to find those instances, it lists all ports bound to a nova instance. Then connects to each control/network nodes, enter qdhcp network namespaces (found with the MAC adresse in DHCP leases files), and try to ping the instance private IP.
The only log I see, but don't know if it has something to do with that is : Device tapxxx-yy has no active binding in host
But I can see that message even since, without unjoinable instances...
Here it is, if something ring a bell, for an existing bug or configuration error, please let us know ! Thanks.
-- Gilles
Le 2024-11-22 11:22, Tobias Urdin a écrit :
Hello,
Hello Tobias,
1. You're probably stuck on copying memory because the instance is using memory faster than you can migrate or pagefault it over.
This can be observed with virsh domjobinfo <instance> on the compute node. We optimize the live migration downtime stepping in nova.conf to work around that and after a while we just force it over but your use-case might not allow that.
Yes, this is the case, but I thought live_migration_permit_auto_converge would solve that. And the other problem is the libvirt stacktrace when I force the completion...
2. Not sure since it's using Linux bridge agent which has been moved to experimental, you probably want to schedule migrating away from using it. Look into the enable_qemu_monitor_announce_self in nova.conf so that nova in post-migration does an QEMU announce_self that sends out RARP frames after the migration is complete if there is a race condition between the RARP frames being sent out and the port being bound, which is the case for us when using OVS agent.
Mmm, that's interesting. I didn't see that parameter. I'm trying to have a reproducible test and will try enable_qemu_monitor_announce_self. I'll come back here if I have a clear answer.
/Tobias
Thank you tobias. -- Gilles
On Thu, Nov 21, 2024 at 06:12:26PM UTC, Gilles Mocellin wrote:
Hi Stackers,
I have 2 recurring problems with live migration. If someone already face that and have recommendations...
My env : - OpenStack 2023.1, deployed by OpenStack-Ansible 27.5.1 - Network : ML2/LinuxBridges - Some shared Networks on a VLAN provider - Most private networks on VXLAN - OS : Ubuntu 20.04 - libvirt-daemon-system: 8.0.0-1ubuntu7.10~cloud0 - qemu-system-x86: 1:4.2-3ubuntu6.30
Each time we have maintenance to do on compute nodes, we live-migrate the instances, with `nova host-evacuate-live` (still no openstack CLI command to do that).
1. Never ending / failed migrations
In general, it works, but each time we have 1 or 2 (on 40 instances) that don't want to complete. It's some memory active instances (Kubernetes controller nodes with etcd...). Even if I had configured live_migration_permit_auto_converge to True.
Now, if we try to force complete (openstack server migration force complete), migration fails, instance return to ACTIVE status on source. And I can see libvirt errors : [...] libvirt.libvirtError: operation failed: migration out job: unexpectedly failed
Not really helpful.
I tried removing live_migration_permit_auto_converge, reduce live_migration_completion_timeout to 60 and set live_migration_timeout_action to force_complete.
Same problems.
2. Instance unreachable
More problematic, as it's not easy to see : some instances are not reachable after migration. And after a while, DHCP leases expire.
The only way to recover is to restart neutron-linuxbridge-agent on the instance compute node. In fact, on all compute nodes, because instances evacuation spread them on all the compute nodes.
Then, if DHCP has not expired, the instance is joinable again.
We made a script to find those instances, it lists all ports bound to a nova instance. Then connects to each control/network nodes, enter qdhcp network namespaces (found with the MAC adresse in DHCP leases files), and try to ping the instance private IP.
The only log I see, but don't know if it has something to do with that is : Device tapxxx-yy has no active binding in host
But I can see that message even since, without unjoinable instances...
Here it is, if something ring a bell, for an existing bug or configuration error, please let us know ! Thanks.
-- Gilles
On 22/11/2024 12:31, Gilles Mocellin wrote:
Le 2024-11-22 11:22, Tobias Urdin a écrit :
Hello,
Hello Tobias,
1. You're probably stuck on copying memory because the instance is using memory faster than you can migrate or pagefault it over.
This can be observed with virsh domjobinfo <instance> on the compute node. We optimize the live migration downtime stepping in nova.conf to work around that and after a while we just force it over but your use-case might not allow that.
Yes, this is the case, but I thought live_migration_permit_auto_converge would solve that. And the other problem is the libvirt stacktrace when I force the completion...
in generall live_migration_permit_auto_convergewhil it can help is much less effective the live_migration_permit_post_copy i dont know if your are using 1G hugepages? live migration is almost impostible with 1G hugepages without post_copy because wrting a single bit a 1G page requried the entire 1G page to be transfered again. post_copy make all the write go to the dest instance after an inital memroy copy and pagefaults reads across the network as pages are needed while it copys them in the backround. auto converge just add micro pauses to briefly stop the guest cpu cores to allow migration to make progress. in other words auto converage is hoping that if we slow the cpu execution enough the migration will eventually complete but it can actully guarentee it will ever complete.
2. Not sure since it's using Linux bridge agent which has been moved to experimental, you probably want to schedule migrating away from using it. Look into the enable_qemu_monitor_announce_self in nova.conf so that nova in post-migration does an QEMU announce_self that sends out RARP frames after the migration is complete if there is a race condition between the RARP frames being sent out and the port being bound, which is the case for us when using OVS agent.
Mmm, that's interesting. I didn't see that parameter. I'm trying to have a reproducible test and will try enable_qemu_monitor_announce_self. I'll come back here if I have a clear answer.
enable_qemu_monitor_announce_self is a workaroudn option in nova it shoudl work with linux bridge too however i dont think its required when using linux bridge becasue configure are no openflow rules ectra to configre, it wont hurt however so its worht enabling if you are having downtime issues.
/Tobias
Thank you tobias.
by the way you ask about `nova host-evacuate-live` and osc. that command is a clinet only implementation that we don't intend to ever port. we do not support it any more. it was deprecated when the nova-client cli was deprecated and it will be removed when python novaclinet deliverable removed. its effectively just a for loop over a server list for a given host with no error handling operators should not use it or build tooling around it. the same applies for the non live version it should not be used.
Le vendredi 22 novembre 2024, 18:48:47 CET Sean Mooney a écrit :
On 22/11/2024 12:31, Gilles Mocellin wrote: [...]
Yes, this is the case, but I thought live_migration_permit_auto_converge would solve that. And the other problem is the libvirt stacktrace when I force the completion...
in generall live_migration_permit_auto_convergewhil it can help is much less effective the
live_migration_permit_post_copy
i dont know if your are using 1G hugepages?
I don't think, except if it's default with OpenStack-Ansible on Ubuntu... I will check.
live migration is almost impostible with 1G hugepages without post_copy because wrting a single
bit a 1G page requried the entire 1G page to be transfered again.
post_copy make all the write go to the dest instance after an inital memroy copy
and pagefaults reads across the network as pages are needed while it copys them in the backround.
I've read about post copy, but it seems to much risky, we can really loose data if the source vM crash, and destination hasn't caught up.
auto converge just add micro pauses to briefly stop the guest cpu cores to allow migration to make progress. in other words auto converage is hoping that if we slow the cpu execution enough the migration will eventually complete but it can actully guarentee it will ever complete.
OK, no guaranty.
2. Not sure since it's using Linux bridge agent which has been moved to experimental, you probably want to schedule migrating away from using it. Look into the enable_qemu_monitor_announce_self in nova.conf so that nova in post-migration does an QEMU announce_self that sends out RARP frames after the migration is complete if there is a race condition between the RARP frames being sent out and the port being bound, which is the case for us when using OVS agent.
Mmm, that's interesting. I didn't see that parameter. I'm trying to have a reproducible test and will try enable_qemu_monitor_announce_self. I'll come back here if I have a clear answer.
enable_qemu_monitor_announce_self is a workaroudn option in nova
it shoudl work with linux bridge too however i dont think its required when using linux bridge becasue configure are no openflow rules ectra to configre, it wont hurt however so its worht enabling if you are having downtime issues.
No Openflows, but iptables rules for security groups...
by the way you ask about `nova host-evacuate-live` and osc. that command is a clinet only implementation that we don't intend to ever port. we do not support it any more. it was deprecated when the nova-client cli was deprecated and it will be removed when python novaclinet deliverable removed.
its effectively just a for loop over a server list for a given host with no error handling operators should not use it or build tooling around it. the same applies for the non live version
it should not be used.
Oh, that's sad. It's really a must have command for any OPS, especially those who came from Vmware vSphere. And it should even be in Horizon (in fact, it is ! But probably also a loop in horizon...). Ceph added a maintenance mode for nodes, that's really something needed for any hardware, firmware or OS update in any clusters. Kunernetes also has a way with cordon / drain nodes. Thanks Sean, PS: Sorry for the rant, but : If we want OpenStack to be a serious alternative for VMware refugees and also to survive the rise of KubeVirt, it should really have a dynamic scheduler like Kubernetes, to allow : - Maintenance mode (live evacuation) - HA (auto restart instance when a hypervisor crash) - "DRS" (Redistribute instances to spread evenly the load on every nodes. Yes, Watcher is resurrected and can do that, but is really not dynamic) For all of that, live migration should be rock solid. And I don't talk about integrated backup and monitoring. End of Rant, sorry again. -- Gilles
participants (3)
-
Gilles Mocellin
-
Sean Mooney
-
Tobias Urdin