[openstack-dev] [Neutron] MTU configuration pain

John Griffith john.griffith8 at gmail.com
Mon Jan 18 21:06:11 UTC 2016


On Sun, Jan 17, 2016 at 8:30 PM, Matt Kassawara <mkassawara at gmail.com>
wrote:

> Prior attempts to solve the MTU problem in neutron simply band-aid it or
> become too complex from feature creep or edge cases that mask the primary
> goal of a simple implementation that works for most deployments. So, I ran
> some experiments to empirically determine the root cause of MTU problems in
> common neutron deployments using the Linux bridge agent. I plan to perform
> these experiments again using the Open vSwitch agent... after sufficient
> mental recovery.
>
> I highly recommend reading further, but here's the TL;DR:
>
> Observations...
>
> 1) During creation of a VXLAN interface, Linux automatically subtracts the
> VXLAN protocol overhead from the MTU of the parent interface.
> 2) A veth pair or tap with a different MTU on each end drops packets
> larger than the smaller MTU.
> 3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of
> all the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to
> 1450 when neutron adds a VXLAN interface to it.
> 4) A bridge with different MTUs on each port drops packets larger than the
> MTU of the bridge.
> 5) A bridge or veth pair with an IP address can participate in path MTU
> discovery (PMTUD). However, these devices do not appear to understand
> namespaces and originate the ICMP message from the host instead of a
> namespace. Therefore, the message never reaches the destination...
> typically a host outside of the deployment.
>
> Conclusion...
>
> The MTU disparity between native and overlay networks must reside in a
> device capable of layer-3 operations that can participate in PMTUD, such as
> the neutron router between a private/project overlay network and a
> public/external native network.
>
> Some background...
>
> In a typical datacenter network, MTU must remain consistent within a
> layer-2 network because fragmentation and the mechanism indicating the need
> for it occurs at layer-3. In other words, all host interfaces and switch
> ports on the same layer-2 network must use the same MTU. If the layer-2
> network connects to a router, the router port must also use the same MTU. A
> router can contain ports on multiple layer-2 networks with different MTUs
> because it operates on those networks at layer-3. If the MTU changes
> between ports on a router and devices on those layer-2 networks attempt to
> communicate at layer-3, the router can perform a couple of actions. For
> IPv4, the router can fragment the packet. However, if the packet contains
> the "don't fragment" (DF) flag, the router can either silently drop the
> packet or return an ICMP "fragmentation needed" message to the sender. This
> ICMP message contains the MTU of the next layer-2 network in the route
> between the sender and receiver. Each router in the path can return these
> ICMP messages to the sender until it learns the maximum MTU for the entire
> path, also known as path MTU discovery (PMTUD). IPv6 does not support
> fragmentation.
>
> The cloud provides a virtual extension of a physical network. In the
> simplest sense, patch cables become veth pairs, switches become bridges,
> and routers become namespaces. Therefore, MTU implementation for virtual
> networks should mimic physical networks where MTU changes must occur within
> a router at layer-3.
>
> For these experiments, my deployment contains one controller and one
> compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The
> configuration does not contain any MTU options (e.g, path_mtu). One VM with
> a floating IP address attaches to a VXLAN private network that routes to a
> flat public network. The DHCP agent does not advertise MTU to the VM. My
> lab resides on public cloud infrastructure with networks that filter
> unknown MAC addresses such as those that neutron generates for virtual
> network components. Let's talk about the implications and workarounds.
>
> The VXLAN protocol contains 50 bytes of overhead. Linux automatically
> calculates the MTU of VXLAN devices by subtracting 50 bytes from the parent
> device, in this case a standard Ethernet interface with a 1500 MTU.
> However, due the limitations of public cloud networks, I must create a
> VXLAN tunnel between the controller node and a host outside of the
> deployment to simulate traffic from a datacenter network. This tunnel
> effectively reduces the "native" MTU from 1500 to 1450. Therefore, I need
> to subtract an additional 50 bytes from neutron VXLAN network components,
> essentially emulating the 50-byte difference between conventional neutron
> VXLAN networks and native networks. The host outside of the deployment
> assumes it can send packets using a 1450 MTU. The VM also assumes it can
> send packets using a 1450 MTU because the DHCP agent does not advertise a
> 1400 MTU to it.
>
> Let's get to it!
>
> Note: The commands in these experiments often generate lengthy output, so
> please refer to the gists when necessary.
>
> First, review the OpenStack bits and resulting network components in the
> environment [1]. Also, see that a regular 'ping' works between the host
> outside of the deployment and the VM [2].
>
> [1] https://gist.github.com/ionosphere80/b78bedfc5e8300b8113e
> [2] https://gist.github.com/ionosphere80/b44358b43af13de74f1f
>
> Note: The tcpdump output in each case references up to six points: neutron
> router gateway on the public network (qg), namespace end of the veth pair
> for the neutron router interface on the private network (qr), bridge end of
> the veth pair for router interface on the private network (tap), controller
> node end of the VXLAN network (vxlan-434), compute node end of the VXLAN
> network (vxlan-434), and the bridge end of the tap for the VM (tap).
>
> A VM typically requires using SSH to access it. A MTU mismatch usually
> manifests itself as a "stuck" SSH connection. Without further
> investigation, the symptoms mistakenly lead people toward security groups.
> However, increasing the SSH client verbosity shows it connecting to the
> server and hanging somewhere during key exchange [3].
>
> [3] https://gist.github.com/ionosphere80/8ccd736bf3dda05a01a0
>
> Does the key exchange contain a packet that exceeds the MTU between the
> client and server? Yes! Looking at [4], the veth pair between the router
> namespace and private network bridge drops the packet. The MTU changes over
> a layer-2 connection without a router, similar to connecting two switches
> with different MTUs. Even if it could participate in PMTUD, the veth pair
> lacks an IP address and therefore cannot originate ICMP messages.
>
> [4] https://gist.github.com/ionosphere80/9eb0e2c0b3e780de9afc
>
> Note: If I try "conventional" MTUs (instead of assuming a maximum of 1450
> due to limitations of cloud networks) and use SSH from the controller node
> access to the VM, the VM tap interface drops key exchange packets due to a
> MTU mismatch... 1500 on the VM end and 1450 the bridge end. Exact results
> probably vary among environments.
>
> Now we know why SSH doesn't work. The following experiments use the 'ping'
> utility because it generates much less traffic and provides a way to
> control the DF flag (-M).
>
> Note: Although the namespace end of the veth pair for the neutron router
> interface on the private network contains a 1450 MTU, it actually doesn't
> pass VXLAN traffic which means it should support a slightly larger ICMP
> payload... 4 extra bytes.
>
> Let's ping with a payload size of 1372, the maximum for a VXLAN segment
> with 1400 MTU, and look at the tcpdump output [5]. Ping operates normally.
>
> # ping -c 1 -s 1372 -M do 10.4.31.102
>
> [5] https://gist.github.com/ionosphere80/89cc8e21060e8988e46c
>
> Let's ping with a payload size of 1373, one byte larger than the maximum
> for a VXLAN segment with 1400 MTU, and look at the tcpdump output [6]. The
> VM does not receive the packet. The private network bridge on the
> controller, only operating at layer-2, drops the packet because it exceeds
> the MTU of the vxlan-434 interface on it. Even if it could participate in
> PMTUD, the bridge lacks an IP address and therefore cannot originate ICMP
> messages.
>
> # ping -c 1 -s 1373 -M do 10.4.31.102
>
> [6] https://gist.github.com/ionosphere80/8a7aa01db29679fbad22
>
> Let's ping with a payload size of 1377, one byte larger than the maximum
> for a "bare" segment with 1400 MTU, and look at the tcpdump output [7]. The
> veth pair for the router interface on the private network, only operating
> at layer-2, drops the packet because it exceeds the MTU of the bridge end
> of it. Even if it could participate in PMTUD, the veth pair lacks an IP
> address and therefore cannot originate ICMP messages.
>
> # ping -c 1 -s 1377 -M do 10.4.31.102
>
> [7] https://gist.github.com/ionosphere80/dd2e3e24f3e94c4801a8
>
> What if we allow fragmentation?
>
> Let's ping again with a payload size of 1373, one byte larger than the
> maximum for a VXLAN segment with 1400 MTU, and look at the tcpdump output
> [8]. The the vxlan-434 interface on the controller node, operating at
> layer-3, fragments the request. The vxlan-434 interface on the compute
> node, operating at layer-3, fragments the reply. Ping operates normally.
>
> # ping -c 1 -s 1373 -M dont 10.4.31.102
>
> [8] https://gist.github.com/ionosphere80/13ebcf1b67c1286012f7
>
> Let's ping again with a payload size of 1377, one byte larger than the
> maximum for a "bare" segment with 1400 MTU, and look at the tcpdump output
> [9]. The veth pair for the router interface on the private network, only
> operating at layer-2, drops the packet because it exceeds the MTU of the
> bridge end of it. Even if it could participate in PMTUD, the veth pair
> lacks an IP address and therefore cannot originate ICMP messages.
>
> # ping -c 1 -s 1377 -M dont 10.4.31.102
>
> [9] https://gist.github.com/ionosphere80/53b14343cd23a620b0ef
>
> In all of these cases, the first MTU disparity appears on a veth pair that
> only operates at layer-2 and therefore cannot participate in PMTUD,
> effectively breaking communication. What happens if we move the first MTU
> disparity to the namespace end of the veth pair for the neutron router
> interface on the private network, effectively making both ends equal with a
> 1400 MTU?
>
> # ip link set dev qr-d9e6ec95-f5 mtu 1400
>
> Let's ping again with a payload size of 1372, the maximum for a VXLAN
> segment with 1400 MTU, and look at the tcpdump output [10]. Ping operates
> normally.
>
> # ping -c 1 -s 1372 -M do 10.4.31.102
>
> [10] https://gist.github.com/ionosphere80/fd5e29d387d009611704
>
> Let's ping again with a payload size of 1373, one byte larger than the
> maximum for a VXLAN segment with 1400 MTU, and look at the tcpdump output
> [11]. The router namespace, operating at layer-3, sees the MTU discrepancy
> between the two interfaces in the namespace and returns an ICMP
> "fragmentation needed" message to the sender. The sender uses the MTU value
> in the ICMP packet to recalculate the length of the first packet and caches
> it for future packets.
>
> # ping -c 1 -s 1373 -M do 10.4.31.102
>
> [11] https://gist.github.com/ionosphere80/43ff558e077acfa92cfc
>
> The 'ip' command reveals the cached MTU value:
>
> # ip route get to 10.4.31.102
> 10.4.31.102 dev vxlan1040  src 10.4.31.1
>     cache  expires 590sec mtu 1400
>
> At least for the Linux bridge agent, I think we can address ingress MTU
> disparity (to the VM) by moving it to the first device in the chain capable
> of layer-3 operations, particularly the neutron router namespace. We can
> address the egress MTU disparity (from the VM) by advertising the MTU of
> the overlay network to the VM via DHCP/RA or using manual interface
> configuration. From a policy standpoint, only the operator should configure
> the native MTU for neutron using a global option in a configuration file,
> leaving a combination of Linux and neutron to automatically calculate the
> MTU for virtual network components and VMs. For VMs using manual interface
> configuration, the user should have read-only access to the MTU for a
> particular network that accounts for overlay protocol overhead. For
> example, the user would see 1450 for a VXLAN network.
>
> Matt
>
> On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins <sean at coreitpro.com>
> wrote:
>
>> MTU has been an ongoing issue in Neutron for _years_.
>>
>> It's such a hassle, that most people just throw up their hands and set
>> their physical infrastructure to jumbo frames. We even document it.
>>
>>
>> http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html
>>
>> > Ideally, you can prevent these problems by enabling jumbo frames on
>> > the physical network that contains your tenant virtual networks. Jumbo
>> > frames support MTUs up to approximately 9000 bytes which negates the
>> > impact of GRE overhead on virtual networks.
>>
>> We've pushed this onto operators and deployers. There's a lot of
>> code in provisioning projects to handle MTUs.
>>
>> http://codesearch.openstack.org/?q=MTU&i=nope&files=&repos=
>>
>> We have mentions to it in our architecture design guide
>>
>>
>> http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150
>>
>> I want to get Neutron to the point where it starts discovering this
>> information and automatically configuring, in the optimistic cases. I
>> understand that it can be complex and have corner cases, but the issue
>> we have today is that it is broken in some multinode jobs, even Neutron
>> developers are configuring it correctly.
>>
>> I also had this discussion on the DevStack side in
>> https://review.openstack.org/#/c/112523/
>> where basically, sure we can fix it in DevStack and at the gate, but it
>> doesn't fix the problem for anyone who isn't using DevStack to deploy
>> their cloud.
>>
>> Today we have a ton of MTU configuration options sprinkled throghout the
>> L3 agent, dhcp agent, l2 agents, and at least one API extension to the
>> REST API for handling MTUs.
>>
>> So yeah, a lot of knobs and not a lot of documentation on how to make
>> this thing work correctly. I'd like to try and simplify.
>>
>>
>> Further reading:
>>
>>
>> http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html
>>
>> http://lists.openstack.org/pipermail/openstack/2013-October/001778.html
>>
>>
>> https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/
>>
>>
>> https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/
>>
>> http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/
>>
>> https://twitter.com/search?q=openstack%20neutron%20MTU
>>
>> --
>> Sean M. Collins
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
​So as a non-networking expert that tried to convert/upgrade their Cloud
deployments from nova-net to Neutron (AGAIN) at Liberty, I worked through
the DNS issues to then hit these issues which I solved after much
frustration by modifying MTU values through a process of trial and error
manually... I'm trying to figure out in the above posts:
1. Does anybody actually have a solution that's being worked/merged?
2. Are we just saying "networking is hard, deal with it"?​

I've tried making the jump every release, and I have to say that things
seem to have come a long way in Liberty (at least from my perspective), but
it still seems super finicky and even a bit "magic".  Even after getting
past the MTU issues I was so discouraged by the assign floating IP process
that I aborted once again and went back to nova-network. :(
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160118/c7803b95/attachment.html>


More information about the OpenStack-dev mailing list