OVN Routers don't seem to be highly available
I have Openstack dalmatian deployed with OVN. All networking seems to work as expected, however OVN routers don't seem to be highly available. I've followed this guide https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html and have enable-chassis-as-gw set on all the compute nodes. I have a very simple network setup with 1 self-service network connected to a router and that router connected to a provider network. Looking in the OVN SB database at the port_binding for the router, it has 1 chassis listed and a ha_chassis_group. There are multiple compute nodes listed in the ha_chassis_group when I lookup that ID in the table. When I fail the server that is listed as the chassis for the port binding it never gets assigned to a different chassis and the instances in the self-service network stop being able to connect to things in the provider network. I've tried various different ways to fail the server, unplugging the self-service nic makes the port_binding chassis show as "[]" and it never gets assigned to another, hard powering off the server keeps the chassis as the old one and it never gets assigned to another. Waiting multiple minutes to 1+ it never gets reassigned. As soon as I bring back the failed compute node, it gets assigned back to that one and everything works. The only way I can seem to get a failover to another compute node is to stop the ovn-controller service on the server that is listed as the chassis for the port binding. In that case another chassis does get assigned and routing works as expected. According to https://docs.openstack.org/neutron/latest/admin/ovn/routing.html under Failover (detected by BFD) this should "just work". I did check the tunnels and BFD status and BFD does show as down as expected, but nothing else happens. Any ideas? I'm not too sure what to look at to debug this, all my configs seem fine as everything else works as expected.
It looks like a known issue: [ovn][l3-ha] BFD monitoring not working for network down: https://bugs.launchpad.net/neutron/+bug/1991791 El jue, 1 may 2025 a las 15:34, Ryan Belgrave (<rmb1993@gmail.com>) escribió:
I have Openstack dalmatian deployed with OVN. All networking seems to work as expected, however OVN routers don't seem to be highly available.
I've followed this guide https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html and have enable-chassis-as-gw set on all the compute nodes.
I have a very simple network setup with 1 self-service network connected to a router and that router connected to a provider network.
Looking in the OVN SB database at the port_binding for the router, it has 1 chassis listed and a ha_chassis_group. There are multiple compute nodes listed in the ha_chassis_group when I lookup that ID in the table. When I fail the server that is listed as the chassis for the port binding it never gets assigned to a different chassis and the instances in the self-service network stop being able to connect to things in the provider network.
I've tried various different ways to fail the server, unplugging the self-service nic makes the port_binding chassis show as "[]" and it never gets assigned to another, hard powering off the server keeps the chassis as the old one and it never gets assigned to another. Waiting multiple minutes to 1+ it never gets reassigned. As soon as I bring back the failed compute node, it gets assigned back to that one and everything works.
The only way I can seem to get a failover to another compute node is to stop the ovn-controller service on the server that is listed as the chassis for the port binding. In that case another chassis does get assigned and routing works as expected.
According to https://docs.openstack.org/neutron/latest/admin/ovn/routing.html under Failover (detected by BFD) this should "just work". I did check the tunnels and BFD status and BFD does show as down as expected, but nothing else happens.
Any ideas? I'm not too sure what to look at to debug this, all my configs seem fine as everything else works as expected.
Just a quick update, I figured it out. I needed three servers set as enable-chassis-as-gw. I guess with only 2 they aren't able to figure out who should bind the port without a tiebreaker. It is not obvious in the documentation that an odd number is needed, but I guess it is rare to run an OpenStack environment with only 2 compute nodes. Thanks! On Thu, May 1, 2025 at 1:13 PM Ryan Belgrave <rmb1993@gmail.com> wrote:
I have Openstack dalmatian deployed with OVN. All networking seems to work as expected, however OVN routers don't seem to be highly available.
I've followed this guide https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html and have enable-chassis-as-gw set on all the compute nodes.
I have a very simple network setup with 1 self-service network connected to a router and that router connected to a provider network.
Looking in the OVN SB database at the port_binding for the router, it has 1 chassis listed and a ha_chassis_group. There are multiple compute nodes listed in the ha_chassis_group when I lookup that ID in the table. When I fail the server that is listed as the chassis for the port binding it never gets assigned to a different chassis and the instances in the self-service network stop being able to connect to things in the provider network.
I've tried various different ways to fail the server, unplugging the self-service nic makes the port_binding chassis show as "[]" and it never gets assigned to another, hard powering off the server keeps the chassis as the old one and it never gets assigned to another. Waiting multiple minutes to 1+ it never gets reassigned. As soon as I bring back the failed compute node, it gets assigned back to that one and everything works.
The only way I can seem to get a failover to another compute node is to stop the ovn-controller service on the server that is listed as the chassis for the port binding. In that case another chassis does get assigned and routing works as expected.
According to https://docs.openstack.org/neutron/latest/admin/ovn/routing.html under Failover (detected by BFD) this should "just work". I did check the tunnels and BFD status and BFD does show as down as expected, but nothing else happens.
Any ideas? I'm not too sure what to look at to debug this, all my configs seem fine as everything else works as expected.
participants (2)
-
Gabriel Talavera
-
Ryan Belgrave