[openstack-dev] [Neutron][L2Pop][HA Routers] Request for comments for a possible solution
mkolesni at redhat.com
Thu Dec 18 12:06:08 UTC 2014
Hi Neutron community members.
I wanted to query the community about a proposal of how to fix HA routers not
working with L2Population (bug 1365476).
This bug is important to fix especially if we want to have HA routers and DVR
routers working together.
What's happening now?
* HA routers use distributed ports, i.e. the port with the same IP & MAC
details is applied on all nodes where an L3 agent is hosting this router.
* Currently, the port details have a binding pointing to an arbitrary node
and this is not updated.
* L2pop takes this "potentially stale" information and uses it to create:
1. A tunnel to the node.
2. An FDB entry that directs traffic for that port to that node.
3. If ARP responder is on, ARP requests will not traverse the network.
* Problem is, the master router wouldn't necessarily be running on the
This means that traffic would not reach the master node but some arbitrary
node where the router master might be running, but might be in another
state (standby, fail).
What is proposed?
Basically the idea is not to do L2Pop for HA router ports that reside on the
Instead, we would create a tunnel to each node hosting the HA router so that
the normal learning switch functionality would take care of switching the
traffic to the master router.
This way no matter where the master router is currently running, the data
plane would know how to forward traffic to it.
This solution requires changes on the controller only.
What's to gain?
* Data plane only solution, independent of the control plane.
* Lowest failover time (same as HA routers today).
* High backport potential:
* No APIs changed/added.
* No configuration changes.
* No DB changes.
* Changes localized to a single file and limited in scope.
What's the alternative?
An alternative solution would be to have the controller update the port binding
on the single port so that the plain old L2Pop happens and notifies about the
location of the master router.
This basically negates all the benefits of the proposed solution, but is wider.
This solution depends on the report-ha-router-master spec which is currently in
the implementation phase.
It's important to note that these two solutions don't collide and could be done
independently. The one I'm proposing just makes more sense from an HA viewpoint
because of it's benefits which fit the HA methodology of being fast & having as
little outside dependency as possible.
It could be done as an initial solution which solves the bug for mechanism
drivers that support normal learning switch (OVS), and later kept as an
optimization to the more general, controller based, solution which will solve
the issue for any mechanism driver working with L2Pop (Linux Bridge, possibly
Would love to hear your thoughts on the subject.
More information about the OpenStack-dev