[openstack-dev] [tripleo] Is it time to reconsider how we configure OVS bridges in the overcloud?
Dan Sneddon
dsneddon at redhat.com
Thu Nov 10 16:55:47 UTC 2016
On 11/10/2016 07:22 AM, Brent Eagles wrote:
> Hi all,
>
>
> A recent critical issue that has come up that has compelled me to
> propose reconsidering our default and OVS based network configuration
> examples :
>
> https://bugs.launchpad.net/tripleo/+bug/1640812 - Network connectivity
> lost on node reboot
>
> I've been thinking about it for awhile, but you could say this bug was
> the "last straw".
>
> While the precise root cause of this issue is still in question, part
> of the problem is that the overcloud nodes communicate with the
> undercloud and each other through an OVS bridge which is also used by
> the overcloud neutron service for external network traffic. For several
> valid reasons, neutron sets the OVS bridge fail_mode to secure (details
> in respective man pages, etc, etc). This mode is stored persistently so
> when the system is rebooted, the bridge is recreated with the secure
> fail_mode in place, blocking network traffic - including DHCP - until
> something comes along and starts setting up flow rules to allow traffic
> to flow. Without an IP address, the node is effectively "unplugged".
> For some reason this isn't happening 100% of the time on the current
> version of CentOS (7.2), but seems to be pretty much 100% on RHEL 7.3.
>
> It raises the question if it is valid for neutron to modify an OVS
> bridge that it *did not create* in a fundamental way like this. If so,
> it implies a contract between the deployer and neutron that the
> deployer can make "no assumptions" about what will happen with the
> bridge once neutron has been configured to access it. If this implied
> contract is valid, required and acceptable, then bridges used for
> neutron should not be used for anything else. The implications with
> respect to tripleo is that we should reconsider how we use OVS bridges
> for network configuration in the overcloud. For example, in single NIC
> situations, instead of having:
>
> (triple configured)
> - eth0
> - br-ex -used for control plane access, internal api, management,
> external, etc. also neutron is configured to use this for the external
> traffic e.g. dataplane in our defaults, which is why the fail_mode gets
> altered
>
> (neutron configured)
>
> - br-int
> - br-tun
>
> To something like:
> (triple configured)
> - eth0
> - br-ctl - used as br-ex is currently used except neutron knows
> nothing about it.
> - br-ex -patched to br-ctl - ostensibly for external traffic and this
> is what neutron in the overcloud is configured to use
> (neutron configured)
> - br-int
> - br-tun
>
> (In all cases, neutron configures patches, etc. between bridges *it
> knows about* as needed. That is, in the second case, tripleo would
> configure the patch between br-ctl and br-ex)
>
> At the cost of an extra bridge (ovs bridge to ovs bridge with patch
> ports is allegedly cheap btw) we get:
> 1. an independently configured bridge for overcloud traffic insulates
> non-tenant node traffic against changes to neutron, including upgrades,
> neutron bugs, etc.
> 2. insulates neutron from changes to the underlying network that it
> doesn't "care" about.
> 3. In OVS only environments, the difference between a single nic
> environment and one where there is a dedicated nic for external traffic
> is, instead of a patch port from br-ctl to br-ex, it is directly
> connected to the nic for the external traffic.
>
> Even without the issue that instigated this message, I think that this
> is a change worth considering.
>
>
> Cheers,
>
>
> Brent
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
Brent,
Thanks for taking the time to analyze this situation. I see a couple of
potential issues with the topology you are suggesting.
First of all, what about the scenario where a system has only 2x10Gb
NICs, and the operator wishes to bond these together on a single
bridge? If we require separate bridges for Neutron than we do for the
control plane, then it would be impossible to configure a system with
only 2 NICs in a fault-tolerant way.
Second, there will be a large percentage of users who already have a
shared br-ex that wish to upgrade. Do we tell them that due to an
architectural change, they now must redeploy a new cloud with a new
topology to use the latest version?
So while I would be on-board with changing our default for new
installations, I don't think that relieves us of the responsibility to
figure out how to handle the edge cases where a separate bridge is not
feasible.
--
Dan Sneddon | Senior Principal OpenStack Engineer
dsneddon at redhat.com | redhat.com/openstack
dsneddon:irc | @dxs:twitter
More information about the OpenStack-dev
mailing list