[openstack-dev] [tripleo] Is it time to reconsider how we configure OVS bridges in the overcloud?

Dan Sneddon dsneddon at redhat.com
Thu Nov 10 16:55:47 UTC 2016

On 11/10/2016 07:22 AM, Brent Eagles wrote:
> Hi all,
> A recent critical issue that has come up that has compelled me to
> propose reconsidering our default and OVS based network configuration
> examples :   
> https://bugs.launchpad.net/tripleo/+bug/1640812 - Network connectivity
> lost on node reboot
> I've been thinking about it for awhile, but you could say this bug was
> the "last straw". 
> While the precise root cause of this issue is still in question, part
> of the problem is that the overcloud nodes communicate with the
> undercloud and each other through an OVS bridge which is also used by
> the overcloud neutron service for external network traffic. For several
> valid reasons, neutron sets the OVS bridge fail_mode to secure (details
> in respective man pages, etc, etc). This mode is stored persistently so
> when the system is rebooted, the bridge is recreated with the secure
> fail_mode in place, blocking network traffic - including DHCP - until
> something comes along and starts setting up flow rules to allow traffic
> to flow.  Without an IP address, the node is effectively "unplugged".
> For some reason this isn't happening 100% of the time on the current
> version of CentOS (7.2), but seems to be pretty much 100% on RHEL 7.3. 
> It raises the question if it is valid for neutron to modify an OVS
> bridge that it *did not create* in a fundamental way like this. If so,
> it implies a contract between the deployer and neutron that the
> deployer can make "no assumptions" about what will happen with the
> bridge once neutron has been configured to access it. If this implied
> contract is valid, required and acceptable, then bridges used for
> neutron should not be used for anything else. The implications with
> respect to tripleo is that we should reconsider how we use OVS bridges
> for network configuration in the overcloud. For example, in single NIC
> situations, instead of having:
> (triple configured)
> - eth0
>   - br-ex -used for control plane access, internal api, management,
> external, etc. also neutron is configured to use this for the external
> traffic e.g. dataplane in our defaults, which is why the fail_mode gets
> altered
> (neutron configured)
> - br-int
> - br-tun
> To something like:
> (triple configured)
> - eth0
>  - br-ctl - used as br-ex is currently used except neutron knows
> nothing about it.
> - br-ex -patched to br-ctl - ostensibly for external traffic and this
> is what neutron in the overcloud is configured to use
> (neutron configured)
> - br-int
> - br-tun
> (In all cases, neutron configures patches, etc. between bridges *it
> knows about* as needed. That is, in the second case, tripleo would
> configure the patch between br-ctl and br-ex)
> At the cost of an extra bridge (ovs bridge to ovs bridge with patch
> ports is allegedly cheap btw) we get:
>  1. an independently configured bridge for overcloud traffic insulates
> non-tenant node traffic against changes to neutron, including upgrades,
> neutron bugs, etc.
>  2. insulates neutron from changes to the underlying network that it
> doesn't "care" about.
>  3. In OVS only environments, the difference between a single nic
> environment and one where there is a dedicated nic for external traffic
> is, instead of a patch port from br-ctl to br-ex, it is directly
> connected to the nic for the external traffic. 
> Even without the issue that instigated this message, I think that this
> is a change worth considering. 
> Cheers,
> Brent
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Thanks for taking the time to analyze this situation. I see a couple of
potential issues with the topology you are suggesting.

First of all, what about the scenario where a system has only 2x10Gb
NICs, and the operator wishes to bond these together on a single
bridge? If we require separate bridges for Neutron than we do for the
control plane, then it would be impossible to configure a system with
only 2 NICs in a fault-tolerant way.

Second, there will be a large percentage of users who already have a
shared br-ex that wish to upgrade. Do we tell them that due to an
architectural change, they now must redeploy a new cloud with a new
topology to use the latest version?

So while I would be on-board with changing our default for new
installations, I don't think that relieves us of the responsibility to
figure out how to handle the edge cases where a separate bridge is not

Dan Sneddon         |  Senior Principal OpenStack Engineer
dsneddon at redhat.com |  redhat.com/openstack
dsneddon:irc        |  @dxs:twitter

More information about the OpenStack-dev mailing list