[neutron] How to fix break of connectivity in case of L3 HA after reboot of backup node

Nate Johnston nate.johnston at redhat.com
Fri Mar 20 15:57:57 UTC 2020


On Fri, Mar 20, 2020 at 03:37:49PM +0100, Slawek Kaplonski wrote:
> Hi,
> 
> We have bug [1] to solve. Basically, when node which is backup node for some
> router, connectivity to external gateway may be broken for some time. It's like
> that because when host is up and L3 agent is configuring qrouter namespace, it
> flush all IPv6 addresses from the qg- interface. And due to that some MLDv2
> packets are sent from this interface which may cause that ToR switch will learn
> qg port's mac address on wrong (backup) node.
> 
> This don't happens every time and for every router because it is a race between
> L3 agent and OVS agent. When L3 agent creates qg interface in br-int, it sets
> tag 4095 for it and traffic sent with such vlan tag is always dropped in br-int.
> So if L3 agent will flush IPv6 addresses before OVS agent wires the port and
> sets correct tag for it, then all is fine. But if OVS agent is first, then MLDv2
> packets are sent to the wire and there is this connectivity break.
> 
> There are proposed 2 ways of fixing this:
>  - [2] which propsoes to add some kind of "communication" between L3 agent and
>    OVS agent and tell OVS agent that tag can be changed only after IPv6 config
>    is finished by L3 agent.
>    Downside of this solution is that it works for OVS agent only, Linuxbridge
>    agent may still hit the same issue. But plus is that after initial
>    configuration of the router, everything else regarding to failover is handled
>    by keepalived only - in same way like it is now.
>  - [3] which sets qg NIC to be DOWN always on backup nodes. So when keepalived
>    failover active router to new node, L3 agent needs to come and switch
>    interfaces to be UP before it will work.
>    The plus of this solution is that it works for all OVS and
>    Linuxbridge L2 agents (and probably for others too) but downside is that
>    failover process is a bit longer and there may be potentially another race
>    condition between L3 agent and keepalived. Keepalived tries to sent gARP
>    packets after switch node to be active, first attempt will always fail as
>    interface is still DOWN. But keepalived will retry those gARPs after some
>    time and this should be fine if L3 agent will already bring interface to be
>    UP.

Personally I find [2] more appealing.  I think that if we find many linuxbridge
users hitting this issue then we can replicate the solution for linuxbridge at
that time, but until then let's not worry about it - the majority of users use
OVS.  And the gARP timegap for solution #3 to me seems like a possbility for
problems or downtime.

Nate

> Both patches are waiting for pretty long time in gerrit and I want to bring more
> visibility for both of them. Please check them and maybe You will have some
> opinions about which solution would be better and which we should go with.
> 
> [1] https://bugs.launchpad.net/neutron/+bug/1859832
> [2] https://review.opendev.org/#/c/702856/
> [3] https://review.opendev.org/#/c/707406/
> 
> -- 
> Slawek Kaplonski
> Senior software engineer
> Red Hat
> 
> 




More information about the openstack-discuss mailing list