[openstack-dev] [Neutron] MTU configuration pain

Matt Kassawara mkassawara at gmail.com
Mon Jan 18 03:30:25 UTC 2016


Prior attempts to solve the MTU problem in neutron simply band-aid it or
become too complex from feature creep or edge cases that mask the primary
goal of a simple implementation that works for most deployments. So, I ran
some experiments to empirically determine the root cause of MTU problems in
common neutron deployments using the Linux bridge agent. I plan to perform
these experiments again using the Open vSwitch agent... after sufficient
mental recovery.

I highly recommend reading further, but here's the TL;DR:

Observations...

1) During creation of a VXLAN interface, Linux automatically subtracts the
VXLAN protocol overhead from the MTU of the parent interface.
2) A veth pair or tap with a different MTU on each end drops packets larger
than the smaller MTU.
3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of all
the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to
1450 when neutron adds a VXLAN interface to it.
4) A bridge with different MTUs on each port drops packets larger than the
MTU of the bridge.
5) A bridge or veth pair with an IP address can participate in path MTU
discovery (PMTUD). However, these devices do not appear to understand
namespaces and originate the ICMP message from the host instead of a
namespace. Therefore, the message never reaches the destination...
typically a host outside of the deployment.

Conclusion...

The MTU disparity between native and overlay networks must reside in a
device capable of layer-3 operations that can participate in PMTUD, such as
the neutron router between a private/project overlay network and a
public/external native network.

Some background...

In a typical datacenter network, MTU must remain consistent within a
layer-2 network because fragmentation and the mechanism indicating the need
for it occurs at layer-3. In other words, all host interfaces and switch
ports on the same layer-2 network must use the same MTU. If the layer-2
network connects to a router, the router port must also use the same MTU. A
router can contain ports on multiple layer-2 networks with different MTUs
because it operates on those networks at layer-3. If the MTU changes
between ports on a router and devices on those layer-2 networks attempt to
communicate at layer-3, the router can perform a couple of actions. For
IPv4, the router can fragment the packet. However, if the packet contains
the "don't fragment" (DF) flag, the router can either silently drop the
packet or return an ICMP "fragmentation needed" message to the sender. This
ICMP message contains the MTU of the next layer-2 network in the route
between the sender and receiver. Each router in the path can return these
ICMP messages to the sender until it learns the maximum MTU for the entire
path, also known as path MTU discovery (PMTUD). IPv6 does not support
fragmentation.

The cloud provides a virtual extension of a physical network. In the
simplest sense, patch cables become veth pairs, switches become bridges,
and routers become namespaces. Therefore, MTU implementation for virtual
networks should mimic physical networks where MTU changes must occur within
a router at layer-3.

For these experiments, my deployment contains one controller and one
compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The
configuration does not contain any MTU options (e.g, path_mtu). One VM with
a floating IP address attaches to a VXLAN private network that routes to a
flat public network. The DHCP agent does not advertise MTU to the VM. My
lab resides on public cloud infrastructure with networks that filter
unknown MAC addresses such as those that neutron generates for virtual
network components. Let's talk about the implications and workarounds.

The VXLAN protocol contains 50 bytes of overhead. Linux automatically
calculates the MTU of VXLAN devices by subtracting 50 bytes from the parent
device, in this case a standard Ethernet interface with a 1500 MTU.
However, due the limitations of public cloud networks, I must create a
VXLAN tunnel between the controller node and a host outside of the
deployment to simulate traffic from a datacenter network. This tunnel
effectively reduces the "native" MTU from 1500 to 1450. Therefore, I need
to subtract an additional 50 bytes from neutron VXLAN network components,
essentially emulating the 50-byte difference between conventional neutron
VXLAN networks and native networks. The host outside of the deployment
assumes it can send packets using a 1450 MTU. The VM also assumes it can
send packets using a 1450 MTU because the DHCP agent does not advertise a
1400 MTU to it.

Let's get to it!

Note: The commands in these experiments often generate lengthy output, so
please refer to the gists when necessary.

First, review the OpenStack bits and resulting network components in the
environment [1]. Also, see that a regular 'ping' works between the host
outside of the deployment and the VM [2].

[1] https://gist.github.com/ionosphere80/b78bedfc5e8300b8113e
[2] https://gist.github.com/ionosphere80/b44358b43af13de74f1f

Note: The tcpdump output in each case references up to six points: neutron
router gateway on the public network (qg), namespace end of the veth pair
for the neutron router interface on the private network (qr), bridge end of
the veth pair for router interface on the private network (tap), controller
node end of the VXLAN network (vxlan-434), compute node end of the VXLAN
network (vxlan-434), and the bridge end of the tap for the VM (tap).

A VM typically requires using SSH to access it. A MTU mismatch usually
manifests itself as a "stuck" SSH connection. Without further
investigation, the symptoms mistakenly lead people toward security groups.
However, increasing the SSH client verbosity shows it connecting to the
server and hanging somewhere during key exchange [3].

[3] https://gist.github.com/ionosphere80/8ccd736bf3dda05a01a0

Does the key exchange contain a packet that exceeds the MTU between the
client and server? Yes! Looking at [4], the veth pair between the router
namespace and private network bridge drops the packet. The MTU changes over
a layer-2 connection without a router, similar to connecting two switches
with different MTUs. Even if it could participate in PMTUD, the veth pair
lacks an IP address and therefore cannot originate ICMP messages.

[4] https://gist.github.com/ionosphere80/9eb0e2c0b3e780de9afc

Note: If I try "conventional" MTUs (instead of assuming a maximum of 1450
due to limitations of cloud networks) and use SSH from the controller node
access to the VM, the VM tap interface drops key exchange packets due to a
MTU mismatch... 1500 on the VM end and 1450 the bridge end. Exact results
probably vary among environments.

Now we know why SSH doesn't work. The following experiments use the 'ping'
utility because it generates much less traffic and provides a way to
control the DF flag (-M).

Note: Although the namespace end of the veth pair for the neutron router
interface on the private network contains a 1450 MTU, it actually doesn't
pass VXLAN traffic which means it should support a slightly larger ICMP
payload... 4 extra bytes.

Let's ping with a payload size of 1372, the maximum for a VXLAN segment
with 1400 MTU, and look at the tcpdump output [5]. Ping operates normally.

# ping -c 1 -s 1372 -M do 10.4.31.102

[5] https://gist.github.com/ionosphere80/89cc8e21060e8988e46c

Let's ping with a payload size of 1373, one byte larger than the maximum
for a VXLAN segment with 1400 MTU, and look at the tcpdump output [6]. The
VM does not receive the packet. The private network bridge on the
controller, only operating at layer-2, drops the packet because it exceeds
the MTU of the vxlan-434 interface on it. Even if it could participate in
PMTUD, the bridge lacks an IP address and therefore cannot originate ICMP
messages.

# ping -c 1 -s 1373 -M do 10.4.31.102

[6] https://gist.github.com/ionosphere80/8a7aa01db29679fbad22

Let's ping with a payload size of 1377, one byte larger than the maximum
for a "bare" segment with 1400 MTU, and look at the tcpdump output [7]. The
veth pair for the router interface on the private network, only operating
at layer-2, drops the packet because it exceeds the MTU of the bridge end
of it. Even if it could participate in PMTUD, the veth pair lacks an IP
address and therefore cannot originate ICMP messages.

# ping -c 1 -s 1377 -M do 10.4.31.102

[7] https://gist.github.com/ionosphere80/dd2e3e24f3e94c4801a8

What if we allow fragmentation?

Let's ping again with a payload size of 1373, one byte larger than the
maximum for a VXLAN segment with 1400 MTU, and look at the tcpdump output
[8]. The the vxlan-434 interface on the controller node, operating at
layer-3, fragments the request. The vxlan-434 interface on the compute
node, operating at layer-3, fragments the reply. Ping operates normally.

# ping -c 1 -s 1373 -M dont 10.4.31.102

[8] https://gist.github.com/ionosphere80/13ebcf1b67c1286012f7

Let's ping again with a payload size of 1377, one byte larger than the
maximum for a "bare" segment with 1400 MTU, and look at the tcpdump output
[9]. The veth pair for the router interface on the private network, only
operating at layer-2, drops the packet because it exceeds the MTU of the
bridge end of it. Even if it could participate in PMTUD, the veth pair
lacks an IP address and therefore cannot originate ICMP messages.

# ping -c 1 -s 1377 -M dont 10.4.31.102

[9] https://gist.github.com/ionosphere80/53b14343cd23a620b0ef

In all of these cases, the first MTU disparity appears on a veth pair that
only operates at layer-2 and therefore cannot participate in PMTUD,
effectively breaking communication. What happens if we move the first MTU
disparity to the namespace end of the veth pair for the neutron router
interface on the private network, effectively making both ends equal with a
1400 MTU?

# ip link set dev qr-d9e6ec95-f5 mtu 1400

Let's ping again with a payload size of 1372, the maximum for a VXLAN
segment with 1400 MTU, and look at the tcpdump output [10]. Ping operates
normally.

# ping -c 1 -s 1372 -M do 10.4.31.102

[10] https://gist.github.com/ionosphere80/fd5e29d387d009611704

Let's ping again with a payload size of 1373, one byte larger than the
maximum for a VXLAN segment with 1400 MTU, and look at the tcpdump output
[11]. The router namespace, operating at layer-3, sees the MTU discrepancy
between the two interfaces in the namespace and returns an ICMP
"fragmentation needed" message to the sender. The sender uses the MTU value
in the ICMP packet to recalculate the length of the first packet and caches
it for future packets.

# ping -c 1 -s 1373 -M do 10.4.31.102

[11] https://gist.github.com/ionosphere80/43ff558e077acfa92cfc

The 'ip' command reveals the cached MTU value:

# ip route get to 10.4.31.102
10.4.31.102 dev vxlan1040  src 10.4.31.1
    cache  expires 590sec mtu 1400

At least for the Linux bridge agent, I think we can address ingress MTU
disparity (to the VM) by moving it to the first device in the chain capable
of layer-3 operations, particularly the neutron router namespace. We can
address the egress MTU disparity (from the VM) by advertising the MTU of
the overlay network to the VM via DHCP/RA or using manual interface
configuration. From a policy standpoint, only the operator should configure
the native MTU for neutron using a global option in a configuration file,
leaving a combination of Linux and neutron to automatically calculate the
MTU for virtual network components and VMs. For VMs using manual interface
configuration, the user should have read-only access to the MTU for a
particular network that accounts for overlay protocol overhead. For
example, the user would see 1450 for a VXLAN network.

Matt

On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins <sean at coreitpro.com>
wrote:

> MTU has been an ongoing issue in Neutron for _years_.
>
> It's such a hassle, that most people just throw up their hands and set
> their physical infrastructure to jumbo frames. We even document it.
>
>
> http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html
>
> > Ideally, you can prevent these problems by enabling jumbo frames on
> > the physical network that contains your tenant virtual networks. Jumbo
> > frames support MTUs up to approximately 9000 bytes which negates the
> > impact of GRE overhead on virtual networks.
>
> We've pushed this onto operators and deployers. There's a lot of
> code in provisioning projects to handle MTUs.
>
> http://codesearch.openstack.org/?q=MTU&i=nope&files=&repos=
>
> We have mentions to it in our architecture design guide
>
>
> http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150
>
> I want to get Neutron to the point where it starts discovering this
> information and automatically configuring, in the optimistic cases. I
> understand that it can be complex and have corner cases, but the issue
> we have today is that it is broken in some multinode jobs, even Neutron
> developers are configuring it correctly.
>
> I also had this discussion on the DevStack side in
> https://review.openstack.org/#/c/112523/
> where basically, sure we can fix it in DevStack and at the gate, but it
> doesn't fix the problem for anyone who isn't using DevStack to deploy
> their cloud.
>
> Today we have a ton of MTU configuration options sprinkled throghout the
> L3 agent, dhcp agent, l2 agents, and at least one API extension to the
> REST API for handling MTUs.
>
> So yeah, a lot of knobs and not a lot of documentation on how to make
> this thing work correctly. I'd like to try and simplify.
>
>
> Further reading:
>
>
> http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html
>
> http://lists.openstack.org/pipermail/openstack/2013-October/001778.html
>
>
> https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/
>
>
> https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/
>
> http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/
>
> https://twitter.com/search?q=openstack%20neutron%20MTU
>
> --
> Sean M. Collins
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160117/3f2b339b/attachment.html>


More information about the OpenStack-dev mailing list