[neutron] How to fix break of connectivity in case of L3 HA after reboot of backup node
Brian Haley
haleyb.dev at gmail.com
Fri Mar 20 18:40:09 UTC 2020
On 3/20/20 11:57 AM, Nate Johnston wrote:
> On Fri, Mar 20, 2020 at 03:37:49PM +0100, Slawek Kaplonski wrote:
>> Hi,
>>
>> We have bug [1] to solve. Basically, when node which is backup node for some
>> router, connectivity to external gateway may be broken for some time. It's like
>> that because when host is up and L3 agent is configuring qrouter namespace, it
>> flush all IPv6 addresses from the qg- interface. And due to that some MLDv2
>> packets are sent from this interface which may cause that ToR switch will learn
>> qg port's mac address on wrong (backup) node.
>>
>> This don't happens every time and for every router because it is a race between
>> L3 agent and OVS agent. When L3 agent creates qg interface in br-int, it sets
>> tag 4095 for it and traffic sent with such vlan tag is always dropped in br-int.
>> So if L3 agent will flush IPv6 addresses before OVS agent wires the port and
>> sets correct tag for it, then all is fine. But if OVS agent is first, then MLDv2
>> packets are sent to the wire and there is this connectivity break.
>>
>> There are proposed 2 ways of fixing this:
>> - [2] which propsoes to add some kind of "communication" between L3 agent and
>> OVS agent and tell OVS agent that tag can be changed only after IPv6 config
>> is finished by L3 agent.
>> Downside of this solution is that it works for OVS agent only, Linuxbridge
>> agent may still hit the same issue. But plus is that after initial
>> configuration of the router, everything else regarding to failover is handled
>> by keepalived only - in same way like it is now.
>> - [3] which sets qg NIC to be DOWN always on backup nodes. So when keepalived
>> failover active router to new node, L3 agent needs to come and switch
>> interfaces to be UP before it will work.
>> The plus of this solution is that it works for all OVS and
>> Linuxbridge L2 agents (and probably for others too) but downside is that
>> failover process is a bit longer and there may be potentially another race
>> condition between L3 agent and keepalived. Keepalived tries to sent gARP
>> packets after switch node to be active, first attempt will always fail as
>> interface is still DOWN. But keepalived will retry those gARPs after some
>> time and this should be fine if L3 agent will already bring interface to be
>> UP.
>
> Personally I find [2] more appealing. I think that if we find many linuxbridge
> users hitting this issue then we can replicate the solution for linuxbridge at
> that time, but until then let's not worry about it - the majority of users use
> OVS. And the gARP timegap for solution #3 to me seems like a possbility for
> problems or downtime.
I would agree, it seemed easier to understand to me as well.
-Brian
>> Both patches are waiting for pretty long time in gerrit and I want to bring more
>> visibility for both of them. Please check them and maybe You will have some
>> opinions about which solution would be better and which we should go with.
>>
>> [1] https://bugs.launchpad.net/neutron/+bug/1859832
>> [2] https://review.opendev.org/#/c/702856/
>> [3] https://review.opendev.org/#/c/707406/
>>
>> --
>> Slawek Kaplonski
>> Senior software engineer
>> Red Hat
>>
>>
>
>
More information about the openstack-discuss
mailing list