[openstack-dev] [tripleo] Is it time to reconsider how we configure OVS bridges in the overcloud?

Brent Eagles beagles at redhat.com
Thu Nov 10 15:22:58 UTC 2016


Hi all,


A recent critical issue that has come up that has compelled me to propose
reconsidering our default and OVS based network configuration examples :

https://bugs.launchpad.net/tripleo/+bug/1640812 - Network connectivity lost
on node reboot

I've been thinking about it for awhile, but you could say this bug was the
"last straw".

While the precise root cause of this issue is still in question, part of
the problem is that the overcloud nodes communicate with the undercloud and
each other through an OVS bridge which is also used by the overcloud
neutron service for external network traffic. For several valid reasons,
neutron sets the OVS bridge fail_mode to secure (details in respective man
pages, etc, etc). This mode is stored persistently so when the system is
rebooted, the bridge is recreated with the secure fail_mode in place,
blocking network traffic - including DHCP - until something comes along and
starts setting up flow rules to allow traffic to flow.  Without an IP
address, the node is effectively "unplugged". For some reason this isn't
happening 100% of the time on the current version of CentOS (7.2), but
seems to be pretty much 100% on RHEL 7.3.

It raises the question if it is valid for neutron to modify an OVS bridge
that it *did not create* in a fundamental way like this. If so, it implies
a contract between the deployer and neutron that the deployer can make "no
assumptions" about what will happen with the bridge once neutron has been
configured to access it. If this implied contract is valid, required and
acceptable, then bridges used for neutron should not be used for anything
else. The implications with respect to tripleo is that we should reconsider
how we use OVS bridges for network configuration in the overcloud. For
example, in single NIC situations, instead of having:

(triple configured)
- eth0
  - br-ex -used for control plane access, internal api, management,
external, etc. also neutron is configured to use this for the external
traffic e.g. dataplane in our defaults, which is why the fail_mode gets
altered

(neutron configured)

- br-int
- br-tun

To something like:
(triple configured)
- eth0
 - br-ctl - used as br-ex is currently used except neutron knows nothing
about it.
- br-ex -patched to br-ctl - ostensibly for external traffic and this is
what neutron in the overcloud is configured to use
(neutron configured)
- br-int
- br-tun

(In all cases, neutron configures patches, etc. between bridges *it knows
about* as needed. That is, in the second case, tripleo would configure the
patch between br-ctl and br-ex)

At the cost of an extra bridge (ovs bridge to ovs bridge with patch ports
is allegedly cheap btw) we get:
 1. an independently configured bridge for overcloud traffic insulates
non-tenant node traffic against changes to neutron, including upgrades,
neutron bugs, etc.
 2. insulates neutron from changes to the underlying network that it
doesn't "care" about.
 3. In OVS only environments, the difference between a single nic
environment and one where there is a dedicated nic for external traffic is,
instead of a patch port from br-ctl to br-ex, it is directly connected to
the nic for the external traffic.

Even without the issue that instigated this message, I think that this is a
change worth considering.


Cheers,


Brent
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20161110/5573042d/attachment.html>


More information about the OpenStack-dev mailing list