[ops][nova][neutron] Proper way to migrate instances between nodes with different ML2 agents types
Hi here,
I'm trying to find a solution to migrate instances between hypervisors of an openstack cluster with nodes running different ML2 agents (OVS and bridges, I'm actually migrating the whole cluster to the latter).
The cluster is running Rocky. I enabled both mechanisms in the neutron- server configuration and some nodes are running the neutron- openvswitch-agent and some other the neutron-linuxbridge-agent. My network nodes (running the l3 agent) are currently running the neutron- openvswitch-agent. I also noticed that when nova-compute is starting up, VIF plugins for OVS and Bridges are loaded ("INFO os_vif [-] Loaded VIF plugins: ovs, linux_bridge").
When I start a live migration for an instance running on an hypervisor using the OVS agent to an hypervisor using the bridge agent, it fails because the destination hypervisor try to execute 'ovs-*' commands to bind the VM to its network. I also tried cold migration and just restarting an hypervisor with the bridge agent instead of the OVS one, but it fails similarly when the instances startup.
After some research, I discovered that the mechanism used to bind an instance port to a network is stored in the port binding configuration in the database and that the code that executes the 'ovs-*' commands is actually located in the os_vif library that is used by the nova-compute agent.
So, I tried to remove the OVS plugin from the os_vif library. Ubuntu ship both plugins in the same package so I just deleted the plugin directory in /usr/lib/python2.7/dist-packages directory (don't judge me please, it's for science ;-)). And... it worked as expected (port bindings are converted to bridge mechanism), at least for the cold migration (hot migration is cancelled without any error message, I need to investigate more).
How can I do those migration the proper way?
Thank you for any help!
Antoine
I have a short and simple question which I couldn't find a clear answer for in the documentation.
I understand that when a task raises a exception in a graph flow it will revert all parents, however, I fail to find any information if it will subsequently prevent the execution of all children.
I imagine yes as the dependencies for these tasks are now unmet but I would like to know for sure.
TL;DR; Does an exception in a graph-flow task prevent the execution of children?
Kind Regards, Corne Lukken (Dantali0n)
Short answer is yes. When a task fails and the revert path starts, it goes back up the graph executing the revert methods and will not execute any children beyond the failed task.
That said, there is an option in the engine to disable reverts (execution will simply halt at the failed task), there are ways to make decision paths, and there is a pretty robust set of retry tools that can be applied in a revert situation.
Michael
On Fri, Nov 8, 2019 at 2:49 AM info@dantalion.nl info@dantalion.nl wrote:
I have a short and simple question which I couldn't find a clear answer for in the documentation.
I understand that when a task raises a exception in a graph flow it will revert all parents, however, I fail to find any information if it will subsequently prevent the execution of all children.
I imagine yes as the dependencies for these tasks are now unmet but I would like to know for sure.
TL;DR; Does an exception in a graph-flow task prevent the execution of children?
Kind Regards, Corne Lukken (Dantali0n)
On 11/8/2019 3:53 AM, Antoine Millet wrote:
How can I do those migration the proper way?
[1] was implemented in Rocky to support live migration between different networking backends (vif types).
A couple of things to check:
1. Is Neutron fully upgraded to Rocky and exposing the "Port Bindings Extended" (binding-extended) extension? Nova uses that to determine if neutron is new enough to create an inactive port binding for the target host prior to starting the live migration.
2. Are your nova-compute services all upgraded to at least Rocky and reporting version >=35 in the services table in the cell1 DB? [2]
3. Do you have [compute]/upgrade_levels RPC pinned to anything below Rocky? Or is that configured to "auto"?
These are things to check just to make sure the basic upgrade requirements are satisfied before the code will even attempt to do the new style binding flow for live migration.
If that's all working properly, you should see this DEBUG log message on the source host during live migration [4].
[1] https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/neu... [2] https://github.com/openstack/nova/blob/a90fe1951200ebd27fe74788c0a96c01104ac... [3] https://docs.openstack.org/nova/rocky/configuration/config.html#upgrade_leve... [4] https://github.com/openstack/nova/blob/a90fe1951200ebd27fe74788c0a96c01104ac...
Matt,
Thank you for your answer!
- Is Neutron fully upgraded to Rocky and exposing the "Port
Bindings Extended" (binding-extended) extension? Nova uses that to determine if neutron is new enough to create an inactive port binding for the target host prior to starting the live migration.
I'm not sure how to test that but my neutron components are all upgraded to rocky / 13.0.4.
- Are your nova-compute services all upgraded to at least Rocky and
reporting version >=35 in the services table in the cell1 DB? [2]
I can confirm that.
- Do you have [compute]/upgrade_levels RPC pinned to anything below
Rocky? Or is that configured to "auto"?
All the upgrade_levels on compute are pinned to auto (control plane and nodes).
If that's all working properly, you should see this DEBUG log message on the source host during live migration [4].
I can actually see this message on the nova-compute logs:
2019-11-08 16:34:18.599 3995 DEBUG nova.virt.libvirt.migration [-] [instance: 3dbf401b-19bf-4342-a04e-6ac9cff99efe] Updating guest XML with vif config: <interface type="bridge">
And here is the problem at the same time at the destination:
2019-11-08 15:34:23.720 4434 ERROR os_vif AgentError: Error during following call to agent: ['ovs-vsctl', '--timeout=120', '--', '--if- exists', 'del-port', u'br-int', u'qvoea312a58-2e']
Antoine
On Fri, 2019-11-08 at 10:53 +0100, Antoine Millet wrote:
Hi here,
I'm trying to find a solution to migrate instances between hypervisors of an openstack cluster with nodes running different ML2 agents (OVS and bridges, I'm actually migrating the whole cluster to the latter).
The cluster is running Rocky. I enabled both mechanisms in the neutron- server configuration and some nodes are running the neutron- openvswitch-agent and some other the neutron-linuxbridge-agent. My network nodes (running the l3 agent) are currently running the neutron- openvswitch-agent. I also noticed that when nova-compute is starting up, VIF plugins for OVS and Bridges are loaded ("INFO os_vif [-] Loaded VIF plugins: ovs, linux_bridge").
When I start a live migration for an instance running on an hypervisor using the OVS agent to an hypervisor using the bridge agent, it fails because the destination hypervisor try to execute 'ovs-*' commands to bind the VM to its network. I also tried cold migration and just restarting an hypervisor with the bridge agent instead of the OVS one, but it fails similarly when the instances startup.
After some research, I discovered that the mechanism used to bind an instance port to a network is stored in the port binding configuration in the database and that the code that executes the 'ovs-*' commands is actually located in the os_vif library that is used by the nova-compute agent.
So, I tried to remove the OVS plugin from the os_vif library. Ubuntu ship both plugins in the same package so I just deleted the plugin directory in /usr/lib/python2.7/dist-packages directory (don't judge me please, it's for science ;-)). And... it worked as expected (port bindings are converted to bridge mechanism), at least for the cold migration (hot migration is cancelled without any error message, I need to investigate more).
so while that is an inventive approch os-vif is not actully involved in the port binding process it handles port pluggin later.
i did some testing aroudn this usecase back in 2018 and found a number of gaps that need to be addressed to support live migration between linux brige and ovs or viseversa first the bridge name not set in vif:binding-details by ml2/linux-bridge https://bugs.launchpad.net/neutron/+bug/1788012
os if we try to go from ovs to linuxbridge we generates the wrong xml and try to add the port to a linux bridge called br-int
Updating guest XML with vif config: <interface type="bridge"> Aug 14 12:15:27 devstack1 nova-compute[14852]: <mac address="fa:16:3e:a9:cf:09"/> Aug 14 12:15:27 devstack1 nova-compute[14852]: <model type="virtio"/> Aug 14 12:15:27 devstack1 nova-compute[14852]: <source bridge="br-int"/> Aug 14 12:15:27 devstack1 nova-compute[14852]: <mtu size="1450"/> Aug 14 12:15:27 devstack1 nova-compute[14852]: <target dev="tapbf69476a-25"/> Aug 14 12:15:27 devstack1 nova-compute[14852]: </interface>
using mixed linux bridnge and ovs host also has other proablems if you are using vxlan or gre because neutron does not form mesh tunnel overly between different ml2 driver. https://bugs.launchpad.net/neutron/+bug/1788023
the linux bridge plugin also uses a different tcp port for reasons(vxlan was merged in teh linux kernel before the inan port number was assigned.)
so in effect there is not support way to do this with a live migration in rocky but there ways to force it to work. the simpelest way to do this is to cold migrate followed by a hard reboot but you need to add both ovs and linux bridge tools on each host but only have 1 agent running.
you can also live migrate twice to the same host and hard reboot.
the first migration will fail. the second should succeed but result in the vm tap device being connected to the wrong bridge and the hard reboot fixes it.
way?
Thank you for any help!
Antoine
participants (5)
-
Antoine Millet
-
info@dantalion.nl
-
Matt Riedemann
-
Michael Johnson
-
Sean Mooney