[neutron] How to fix break of connectivity in case of L3 HA after reboot of backup node
Nate Johnston
nate.johnston at redhat.com
Fri Mar 20 15:57:57 UTC 2020
On Fri, Mar 20, 2020 at 03:37:49PM +0100, Slawek Kaplonski wrote:
> Hi,
>
> We have bug [1] to solve. Basically, when node which is backup node for some
> router, connectivity to external gateway may be broken for some time. It's like
> that because when host is up and L3 agent is configuring qrouter namespace, it
> flush all IPv6 addresses from the qg- interface. And due to that some MLDv2
> packets are sent from this interface which may cause that ToR switch will learn
> qg port's mac address on wrong (backup) node.
>
> This don't happens every time and for every router because it is a race between
> L3 agent and OVS agent. When L3 agent creates qg interface in br-int, it sets
> tag 4095 for it and traffic sent with such vlan tag is always dropped in br-int.
> So if L3 agent will flush IPv6 addresses before OVS agent wires the port and
> sets correct tag for it, then all is fine. But if OVS agent is first, then MLDv2
> packets are sent to the wire and there is this connectivity break.
>
> There are proposed 2 ways of fixing this:
> - [2] which propsoes to add some kind of "communication" between L3 agent and
> OVS agent and tell OVS agent that tag can be changed only after IPv6 config
> is finished by L3 agent.
> Downside of this solution is that it works for OVS agent only, Linuxbridge
> agent may still hit the same issue. But plus is that after initial
> configuration of the router, everything else regarding to failover is handled
> by keepalived only - in same way like it is now.
> - [3] which sets qg NIC to be DOWN always on backup nodes. So when keepalived
> failover active router to new node, L3 agent needs to come and switch
> interfaces to be UP before it will work.
> The plus of this solution is that it works for all OVS and
> Linuxbridge L2 agents (and probably for others too) but downside is that
> failover process is a bit longer and there may be potentially another race
> condition between L3 agent and keepalived. Keepalived tries to sent gARP
> packets after switch node to be active, first attempt will always fail as
> interface is still DOWN. But keepalived will retry those gARPs after some
> time and this should be fine if L3 agent will already bring interface to be
> UP.
Personally I find [2] more appealing. I think that if we find many linuxbridge
users hitting this issue then we can replicate the solution for linuxbridge at
that time, but until then let's not worry about it - the majority of users use
OVS. And the gARP timegap for solution #3 to me seems like a possbility for
problems or downtime.
Nate
> Both patches are waiting for pretty long time in gerrit and I want to bring more
> visibility for both of them. Please check them and maybe You will have some
> opinions about which solution would be better and which we should go with.
>
> [1] https://bugs.launchpad.net/neutron/+bug/1859832
> [2] https://review.opendev.org/#/c/702856/
> [3] https://review.opendev.org/#/c/707406/
>
> --
> Slawek Kaplonski
> Senior software engineer
> Red Hat
>
>
More information about the openstack-discuss
mailing list