[neutron] How to fix break of connectivity in case of L3 HA after reboot of backup node

LIU Yulong i at liuyulong.me
Fri Mar 20 21:30:31 UTC 2020

> Hello:
> As commented by Nate and Brian, and myself in [2] and [3], I prefer [2]. I understand this is a fix
> only for OVS, but:
> - It limits the solution to the external GW port plugging process, where the problem appears.
> - The second solution, as you commented, can introduce a race condition between the L3 agent and
> keepalived process, and a possible delay in the HA switch process.
> Regards.

We have run that code [3] for a long time, and no state change delay was seen.
So I may wonder is there any test results show a delay? 

On Fri, 2020-03-20 at 14:40 -0400, Brian Haley wrote:
> On 3/20/20 11:57 AM, Nate Johnston wrote:
> > On Fri, Mar 20, 2020 at 03:37:49PM +0100, Slawek Kaplonski wrote:
> > > Hi,
> > > 
> > > We have bug [1] to solve. Basically, when node which is backup node for some
> > > router, connectivity to external gateway may be broken for some time. It's like
> > > that because when host is up and L3 agent is configuring qrouter namespace, it
> > > flush all IPv6 addresses from the qg- interface. And due to that some MLDv2
> > > packets are sent from this interface which may cause that ToR switch will learn
> > > qg port's mac address on wrong (backup) node.
> > > 
> > > This don't happens every time and for every router because it is a race between
> > > L3 agent and OVS agent. When L3 agent creates qg interface in br-int, it sets
> > > tag 4095 for it and traffic sent with such vlan tag is always dropped in br-int.
> > > So if L3 agent will flush IPv6 addresses before OVS agent wires the port and
> > > sets correct tag for it, then all is fine. But if OVS agent is first, then MLDv2
> > > packets are sent to the wire and there is this connectivity break.
> > > 
> > > There are proposed 2 ways of fixing this:
> > >   - [2] which propsoes to add some kind of "communication" between L3 agent and
> > >     OVS agent and tell OVS agent that tag can be changed only after IPv6 config
> > >     is finished by L3 agent.

What if ovs-agent has finished the port processing, and then L3 agent just set the port to "INTERNAL_STATUS_ACTIVE = "active"".
I don't think the port will be processed again. So it will 4095 forever? Is that a race condition?

> > >     Downside of this solution is that it works for OVS agent only, Linuxbridge
> > >     agent may still hit the same issue. But plus is that after initial
> > >     configuration of the router, everything else regarding to failover is handled
> > >     by keepalived only - in same way like it is now.

HA router failover is one case, add HA router a schedule instance (neutron l3-agent-router-add) to a new
L3-agent is facing the same root cause of IPv6 related packets.

> > >   - [3] which sets qg NIC to be DOWN always on backup nodes. So when keepalived
> > >     failover active router to new node, L3 agent needs to come and switch
> > >     interfaces to be UP before it will work.
> > >     The plus of this solution is that it works for all OVS and
> > >     Linuxbridge L2 agents (and probably for others too) but downside is that
> > >     failover process is a bit longer and there may be potentially another race
> > >     condition between L3 agent and keepalived. Keepalived tries to sent gARP
> > >     packets after switch node to be active, first attempt will always fail as
> > >     interface is still DOWN. But keepalived will retry those gARPs after some
> > >     time and this should be fine if L3 agent will already bring interface to be
> > >     UP.

This is what the patch https://review.opendev.org/#/c/712474/ is doing now.
The keepalived will try to send 5 times garp (default value of vrrp_garp_master_repeat) after transition to
MASTER. And there is a delay (vrrp_garp_interval) between gratuitous ARP messages sent on an interface
(https://www.keepalived.org/manpage.html). The default value is zero that means if one get failed, try
next time immediately. In some extreme situations the keepalived may get failed to send the garp
packets out due to the device or underlay dataplane is not ready.

Actually with the help of the fix [3] and related testing, we just found out the potential lacks of the Keepalived
config options. So it should be a good change to tune it.

So about the race condition I may say it was not seen locally. If there are any test results,
that would be very useful for distinguishing problems.

> > 
> > Personally I find [2] more appealing.  I think that if we find many linuxbridge
> > users hitting this issue then we can replicate the solution for linuxbridge at
> > that time, but until then let's not worry about it - the majority of users use
> > OVS.  And the gARP timegap for solution #3 to me seems like a possbility for
> > problems or downtime.

Linux bridge driver meets the same issue. The driver is still alive, and for stable branches the fix is also worth to do.
It is very simple to simulate the issue, just link up a veth pair device, you will dump the packets on the interface.

> I would agree, it seemed easier to understand to me as well.
> -Brian

For a running cloud you need to restart ovs-agent and l3-agent to achive the fix [2], and mostly the centralized
network node may have tons of ports which will take significant time to "re-added" for the ovs-agent. And absolutely,
restart time for L3-agent is also needed.
And my opinions at the very beginning, the fix [2] is trying to expand the HA logical
from L3 to L2, and introduce protential fail point in ovs-agent for HA routers. That could have some side-effect like unexpected code
aggression on ovs-agent. Someday a guy may say: "I just changed the ovs-agent code, but why HA router does not work?"

> > > Both patches are waiting for pretty long time in gerrit and I want to bring more
> > > visibility for both of them. Please check them and maybe You will have some
> > > opinions about which solution would be better and which we should go with.
> > > 
> > > [1] https://bugs.launchpad.net/neutron/+bug/1859832
> > > [2] https://review.opendev.org/#/c/702856/
> > > [3] https://review.opendev.org/#/c/707406/
> > > 
> > > -- 
> > > Slawek Kaplonski
> > > Senior software engineer
> > > Red Hat
> > > 
> > > 

Seems we are repeating the discuss here, why not back to gerrit? since all the code links are pasted here.

LIU Yulong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200321/802ec710/attachment.html>

More information about the openstack-discuss mailing list