Open Stack

Fri Mar 20 14:37:49 UTC 2020

Hi,

We have bug [1] to solve. Basically, when node which is backup node for some
router, connectivity to external gateway may be broken for some time. It's like
that because when host is up and L3 agent is configuring qrouter namespace, it
flush all IPv6 addresses from the qg- interface. And due to that some MLDv2
packets are sent from this interface which may cause that ToR switch will learn
qg port's mac address on wrong (backup) node.

This don't happens every time and for every router because it is a race between
L3 agent and OVS agent. When L3 agent creates qg interface in br-int, it sets
tag 4095 for it and traffic sent with such vlan tag is always dropped in br-int.
So if L3 agent will flush IPv6 addresses before OVS agent wires the port and
sets correct tag for it, then all is fine. But if OVS agent is first, then MLDv2
packets are sent to the wire and there is this connectivity break.

There are proposed 2 ways of fixing this:
 - [2] which propsoes to add some kind of "communication" between L3 agent and
   OVS agent and tell OVS agent that tag can be changed only after IPv6 config
   is finished by L3 agent.
   Downside of this solution is that it works for OVS agent only, Linuxbridge
   agent may still hit the same issue. But plus is that after initial
   configuration of the router, everything else regarding to failover is handled
   by keepalived only - in same way like it is now.
 - [3] which sets qg NIC to be DOWN always on backup nodes. So when keepalived
   failover active router to new node, L3 agent needs to come and switch
   interfaces to be UP before it will work.
   The plus of this solution is that it works for all OVS and
   Linuxbridge L2 agents (and probably for others too) but downside is that
   failover process is a bit longer and there may be potentially another race
   condition between L3 agent and keepalived. Keepalived tries to sent gARP
   packets after switch node to be active, first attempt will always fail as
   interface is still DOWN. But keepalived will retry those gARPs after some
   time and this should be fine if L3 agent will already bring interface to be
   UP.

Both patches are waiting for pretty long time in gerrit and I want to bring more
visibility for both of them. Please check them and maybe You will have some
opinions about which solution would be better and which we should go with.

[1] https://bugs.launchpad.net/neutron/+bug/1859832
[2] https://review.opendev.org/#/c/702856/
[3] https://review.opendev.org/#/c/707406/

-- 
Slawek Kaplonski
Senior software engineer
Red Hat

Open Stack

[neutron] How to fix break of connectivity in case of L3 HA after reboot of backup node

OpenStack

Community

Documentation

Branding & Legal