[neutron] How to fix break of connectivity in case of L3 HA after reboot of backup node

Brian Haley haleyb.dev at gmail.com
Fri Mar 20 18:40:09 UTC 2020


On 3/20/20 11:57 AM, Nate Johnston wrote:
> On Fri, Mar 20, 2020 at 03:37:49PM +0100, Slawek Kaplonski wrote:
>> Hi,
>>
>> We have bug [1] to solve. Basically, when node which is backup node for some
>> router, connectivity to external gateway may be broken for some time. It's like
>> that because when host is up and L3 agent is configuring qrouter namespace, it
>> flush all IPv6 addresses from the qg- interface. And due to that some MLDv2
>> packets are sent from this interface which may cause that ToR switch will learn
>> qg port's mac address on wrong (backup) node.
>>
>> This don't happens every time and for every router because it is a race between
>> L3 agent and OVS agent. When L3 agent creates qg interface in br-int, it sets
>> tag 4095 for it and traffic sent with such vlan tag is always dropped in br-int.
>> So if L3 agent will flush IPv6 addresses before OVS agent wires the port and
>> sets correct tag for it, then all is fine. But if OVS agent is first, then MLDv2
>> packets are sent to the wire and there is this connectivity break.
>>
>> There are proposed 2 ways of fixing this:
>>   - [2] which propsoes to add some kind of "communication" between L3 agent and
>>     OVS agent and tell OVS agent that tag can be changed only after IPv6 config
>>     is finished by L3 agent.
>>     Downside of this solution is that it works for OVS agent only, Linuxbridge
>>     agent may still hit the same issue. But plus is that after initial
>>     configuration of the router, everything else regarding to failover is handled
>>     by keepalived only - in same way like it is now.
>>   - [3] which sets qg NIC to be DOWN always on backup nodes. So when keepalived
>>     failover active router to new node, L3 agent needs to come and switch
>>     interfaces to be UP before it will work.
>>     The plus of this solution is that it works for all OVS and
>>     Linuxbridge L2 agents (and probably for others too) but downside is that
>>     failover process is a bit longer and there may be potentially another race
>>     condition between L3 agent and keepalived. Keepalived tries to sent gARP
>>     packets after switch node to be active, first attempt will always fail as
>>     interface is still DOWN. But keepalived will retry those gARPs after some
>>     time and this should be fine if L3 agent will already bring interface to be
>>     UP.
> 
> Personally I find [2] more appealing.  I think that if we find many linuxbridge
> users hitting this issue then we can replicate the solution for linuxbridge at
> that time, but until then let's not worry about it - the majority of users use
> OVS.  And the gARP timegap for solution #3 to me seems like a possbility for
> problems or downtime.

I would agree, it seemed easier to understand to me as well.

-Brian

>> Both patches are waiting for pretty long time in gerrit and I want to bring more
>> visibility for both of them. Please check them and maybe You will have some
>> opinions about which solution would be better and which we should go with.
>>
>> [1] https://bugs.launchpad.net/neutron/+bug/1859832
>> [2] https://review.opendev.org/#/c/702856/
>> [3] https://review.opendev.org/#/c/707406/
>>
>> -- 
>> Slawek Kaplonski
>> Senior software engineer
>> Red Hat
>>
>>
> 
> 




More information about the openstack-discuss mailing list