[openstack-dev] [Neutron] MTU configuration pain

Kevin Benton blak111 at gmail.com
Mon Jan 18 22:14:14 UTC 2016


Thanks for the awesome writeup.

>5) A bridge or veth pair with an IP address can participate in path MTU
discovery (PMTUD). However, these devices do not appear to understand
namespaces and originate the ICMP message from the host instead of a
namespace. Therefore, the message never reaches the destination...
typically a host outside of the deployment.

I suspect this is because we don't put the bridges into namespaces. Even if
we did do this, we would need to allocate IP addresses for every compute
node to use to chat on the network...


>At least for the Linux bridge agent, I think we can address ingress MTU
disparity (to the VM) by moving it to the first device in the chain capable
of layer-3 operations, particularly the neutron router namespace. We can
address the egress MTU disparity (from the VM) by advertising the MTU of
the overlay network to the VM via DHCP/RA or using manual interface
configuration.

So when setting up DHCP for the subnet, would telling the DHCP agent to use
an MTU we calculate based on (global MTU value - network encap overhead)
achieve what you are suggesting here?

On Sun, Jan 17, 2016 at 10:30 PM, Matt Kassawara <mkassawara at gmail.com>
wrote:

> Prior attempts to solve the MTU problem in neutron simply band-aid it or
> become too complex from feature creep or edge cases that mask the primary
> goal of a simple implementation that works for most deployments. So, I ran
> some experiments to empirically determine the root cause of MTU problems in
> common neutron deployments using the Linux bridge agent. I plan to perform
> these experiments again using the Open vSwitch agent... after sufficient
> mental recovery.
>
> I highly recommend reading further, but here's the TL;DR:
>
> Observations...
>
> 1) During creation of a VXLAN interface, Linux automatically subtracts the
> VXLAN protocol overhead from the MTU of the parent interface.
> 2) A veth pair or tap with a different MTU on each end drops packets
> larger than the smaller MTU.
> 3) Linux automatically adjusts the MTU of a bridge to the lowest MTU of
> all the ports. Therefore, Linux reduces the typical bridge MTU from 1500 to
> 1450 when neutron adds a VXLAN interface to it.
> 4) A bridge with different MTUs on each port drops packets larger than the
> MTU of the bridge.
> 5) A bridge or veth pair with an IP address can participate in path MTU
> discovery (PMTUD). However, these devices do not appear to understand
> namespaces and originate the ICMP message from the host instead of a
> namespace. Therefore, the message never reaches the destination...
> typically a host outside of the deployment.
>
> Conclusion...
>
> The MTU disparity between native and overlay networks must reside in a
> device capable of layer-3 operations that can participate in PMTUD, such as
> the neutron router between a private/project overlay network and a
> public/external native network.
>
> Some background...
>
> In a typical datacenter network, MTU must remain consistent within a
> layer-2 network because fragmentation and the mechanism indicating the need
> for it occurs at layer-3. In other words, all host interfaces and switch
> ports on the same layer-2 network must use the same MTU. If the layer-2
> network connects to a router, the router port must also use the same MTU. A
> router can contain ports on multiple layer-2 networks with different MTUs
> because it operates on those networks at layer-3. If the MTU changes
> between ports on a router and devices on those layer-2 networks attempt to
> communicate at layer-3, the router can perform a couple of actions. For
> IPv4, the router can fragment the packet. However, if the packet contains
> the "don't fragment" (DF) flag, the router can either silently drop the
> packet or return an ICMP "fragmentation needed" message to the sender. This
> ICMP message contains the MTU of the next layer-2 network in the route
> between the sender and receiver. Each router in the path can return these
> ICMP messages to the sender until it learns the maximum MTU for the entire
> path, also known as path MTU discovery (PMTUD). IPv6 does not support
> fragmentation.
>
> The cloud provides a virtual extension of a physical network. In the
> simplest sense, patch cables become veth pairs, switches become bridges,
> and routers become namespaces. Therefore, MTU implementation for virtual
> networks should mimic physical networks where MTU changes must occur within
> a router at layer-3.
>
> For these experiments, my deployment contains one controller and one
> compute node. Neutron uses the ML2 plug-in and Linux bridge agent. The
> configuration does not contain any MTU options (e.g, path_mtu). One VM with
> a floating IP address attaches to a VXLAN private network that routes to a
> flat public network. The DHCP agent does not advertise MTU to the VM. My
> lab resides on public cloud infrastructure with networks that filter
> unknown MAC addresses such as those that neutron generates for virtual
> network components. Let's talk about the implications and workarounds.
>
> The VXLAN protocol contains 50 bytes of overhead. Linux automatically
> calculates the MTU of VXLAN devices by subtracting 50 bytes from the parent
> device, in this case a standard Ethernet interface with a 1500 MTU.
> However, due the limitations of public cloud networks, I must create a
> VXLAN tunnel between the controller node and a host outside of the
> deployment to simulate traffic from a datacenter network. This tunnel
> effectively reduces the "native" MTU from 1500 to 1450. Therefore, I need
> to subtract an additional 50 bytes from neutron VXLAN network components,
> essentially emulating the 50-byte difference between conventional neutron
> VXLAN networks and native networks. The host outside of the deployment
> assumes it can send packets using a 1450 MTU. The VM also assumes it can
> send packets using a 1450 MTU because the DHCP agent does not advertise a
> 1400 MTU to it.
>
> Let's get to it!
>
> Note: The commands in these experiments often generate lengthy output, so
> please refer to the gists when necessary.
>
> First, review the OpenStack bits and resulting network components in the
> environment [1]. Also, see that a regular 'ping' works between the host
> outside of the deployment and the VM [2].
>
> [1] https://gist.github.com/ionosphere80/b78bedfc5e8300b8113e
> [2] https://gist.github.com/ionosphere80/b44358b43af13de74f1f
>
> Note: The tcpdump output in each case references up to six points: neutron
> router gateway on the public network (qg), namespace end of the veth pair
> for the neutron router interface on the private network (qr), bridge end of
> the veth pair for router interface on the private network (tap), controller
> node end of the VXLAN network (vxlan-434), compute node end of the VXLAN
> network (vxlan-434), and the bridge end of the tap for the VM (tap).
>
> A VM typically requires using SSH to access it. A MTU mismatch usually
> manifests itself as a "stuck" SSH connection. Without further
> investigation, the symptoms mistakenly lead people toward security groups.
> However, increasing the SSH client verbosity shows it connecting to the
> server and hanging somewhere during key exchange [3].
>
> [3] https://gist.github.com/ionosphere80/8ccd736bf3dda05a01a0
>
> Does the key exchange contain a packet that exceeds the MTU between the
> client and server? Yes! Looking at [4], the veth pair between the router
> namespace and private network bridge drops the packet. The MTU changes over
> a layer-2 connection without a router, similar to connecting two switches
> with different MTUs. Even if it could participate in PMTUD, the veth pair
> lacks an IP address and therefore cannot originate ICMP messages.
>
> [4] https://gist.github.com/ionosphere80/9eb0e2c0b3e780de9afc
>
> Note: If I try "conventional" MTUs (instead of assuming a maximum of 1450
> due to limitations of cloud networks) and use SSH from the controller node
> access to the VM, the VM tap interface drops key exchange packets due to a
> MTU mismatch... 1500 on the VM end and 1450 the bridge end. Exact results
> probably vary among environments.
>
> Now we know why SSH doesn't work. The following experiments use the 'ping'
> utility because it generates much less traffic and provides a way to
> control the DF flag (-M).
>
> Note: Although the namespace end of the veth pair for the neutron router
> interface on the private network contains a 1450 MTU, it actually doesn't
> pass VXLAN traffic which means it should support a slightly larger ICMP
> payload... 4 extra bytes.
>
> Let's ping with a payload size of 1372, the maximum for a VXLAN segment
> with 1400 MTU, and look at the tcpdump output [5]. Ping operates normally.
>
> # ping -c 1 -s 1372 -M do 10.4.31.102
>
> [5] https://gist.github.com/ionosphere80/89cc8e21060e8988e46c
>
> Let's ping with a payload size of 1373, one byte larger than the maximum
> for a VXLAN segment with 1400 MTU, and look at the tcpdump output [6]. The
> VM does not receive the packet. The private network bridge on the
> controller, only operating at layer-2, drops the packet because it exceeds
> the MTU of the vxlan-434 interface on it. Even if it could participate in
> PMTUD, the bridge lacks an IP address and therefore cannot originate ICMP
> messages.
>
> # ping -c 1 -s 1373 -M do 10.4.31.102
>
> [6] https://gist.github.com/ionosphere80/8a7aa01db29679fbad22
>
> Let's ping with a payload size of 1377, one byte larger than the maximum
> for a "bare" segment with 1400 MTU, and look at the tcpdump output [7]. The
> veth pair for the router interface on the private network, only operating
> at layer-2, drops the packet because it exceeds the MTU of the bridge end
> of it. Even if it could participate in PMTUD, the veth pair lacks an IP
> address and therefore cannot originate ICMP messages.
>
> # ping -c 1 -s 1377 -M do 10.4.31.102
>
> [7] https://gist.github.com/ionosphere80/dd2e3e24f3e94c4801a8
>
> What if we allow fragmentation?
>
> Let's ping again with a payload size of 1373, one byte larger than the
> maximum for a VXLAN segment with 1400 MTU, and look at the tcpdump output
> [8]. The the vxlan-434 interface on the controller node, operating at
> layer-3, fragments the request. The vxlan-434 interface on the compute
> node, operating at layer-3, fragments the reply. Ping operates normally.
>
> # ping -c 1 -s 1373 -M dont 10.4.31.102
>
> [8] https://gist.github.com/ionosphere80/13ebcf1b67c1286012f7
>
> Let's ping again with a payload size of 1377, one byte larger than the
> maximum for a "bare" segment with 1400 MTU, and look at the tcpdump output
> [9]. The veth pair for the router interface on the private network, only
> operating at layer-2, drops the packet because it exceeds the MTU of the
> bridge end of it. Even if it could participate in PMTUD, the veth pair
> lacks an IP address and therefore cannot originate ICMP messages.
>
> # ping -c 1 -s 1377 -M dont 10.4.31.102
>
> [9] https://gist.github.com/ionosphere80/53b14343cd23a620b0ef
>
> In all of these cases, the first MTU disparity appears on a veth pair that
> only operates at layer-2 and therefore cannot participate in PMTUD,
> effectively breaking communication. What happens if we move the first MTU
> disparity to the namespace end of the veth pair for the neutron router
> interface on the private network, effectively making both ends equal with a
> 1400 MTU?
>
> # ip link set dev qr-d9e6ec95-f5 mtu 1400
>
> Let's ping again with a payload size of 1372, the maximum for a VXLAN
> segment with 1400 MTU, and look at the tcpdump output [10]. Ping operates
> normally.
>
> # ping -c 1 -s 1372 -M do 10.4.31.102
>
> [10] https://gist.github.com/ionosphere80/fd5e29d387d009611704
>
> Let's ping again with a payload size of 1373, one byte larger than the
> maximum for a VXLAN segment with 1400 MTU, and look at the tcpdump output
> [11]. The router namespace, operating at layer-3, sees the MTU discrepancy
> between the two interfaces in the namespace and returns an ICMP
> "fragmentation needed" message to the sender. The sender uses the MTU value
> in the ICMP packet to recalculate the length of the first packet and caches
> it for future packets.
>
> # ping -c 1 -s 1373 -M do 10.4.31.102
>
> [11] https://gist.github.com/ionosphere80/43ff558e077acfa92cfc
>
> The 'ip' command reveals the cached MTU value:
>
> # ip route get to 10.4.31.102
> 10.4.31.102 dev vxlan1040  src 10.4.31.1
>     cache  expires 590sec mtu 1400
>
> At least for the Linux bridge agent, I think we can address ingress MTU
> disparity (to the VM) by moving it to the first device in the chain capable
> of layer-3 operations, particularly the neutron router namespace. We can
> address the egress MTU disparity (from the VM) by advertising the MTU of
> the overlay network to the VM via DHCP/RA or using manual interface
> configuration. From a policy standpoint, only the operator should configure
> the native MTU for neutron using a global option in a configuration file,
> leaving a combination of Linux and neutron to automatically calculate the
> MTU for virtual network components and VMs. For VMs using manual interface
> configuration, the user should have read-only access to the MTU for a
> particular network that accounts for overlay protocol overhead. For
> example, the user would see 1450 for a VXLAN network.
>
> Matt
>
> On Fri, Jan 15, 2016 at 10:41 AM, Sean M. Collins <sean at coreitpro.com>
> wrote:
>
>> MTU has been an ongoing issue in Neutron for _years_.
>>
>> It's such a hassle, that most people just throw up their hands and set
>> their physical infrastructure to jumbo frames. We even document it.
>>
>>
>> http://docs.openstack.org/juno/install-guide/install/apt-debian/content/neutron-network-node.html
>>
>> > Ideally, you can prevent these problems by enabling jumbo frames on
>> > the physical network that contains your tenant virtual networks. Jumbo
>> > frames support MTUs up to approximately 9000 bytes which negates the
>> > impact of GRE overhead on virtual networks.
>>
>> We've pushed this onto operators and deployers. There's a lot of
>> code in provisioning projects to handle MTUs.
>>
>> http://codesearch.openstack.org/?q=MTU&i=nope&files=&repos=
>>
>> We have mentions to it in our architecture design guide
>>
>>
>> http://git.openstack.org/cgit/openstack/openstack-manuals/tree/doc/arch-design/source/network-focus-architecture.rst#n150
>>
>> I want to get Neutron to the point where it starts discovering this
>> information and automatically configuring, in the optimistic cases. I
>> understand that it can be complex and have corner cases, but the issue
>> we have today is that it is broken in some multinode jobs, even Neutron
>> developers are configuring it correctly.
>>
>> I also had this discussion on the DevStack side in
>> https://review.openstack.org/#/c/112523/
>> where basically, sure we can fix it in DevStack and at the gate, but it
>> doesn't fix the problem for anyone who isn't using DevStack to deploy
>> their cloud.
>>
>> Today we have a ton of MTU configuration options sprinkled throghout the
>> L3 agent, dhcp agent, l2 agents, and at least one API extension to the
>> REST API for handling MTUs.
>>
>> So yeah, a lot of knobs and not a lot of documentation on how to make
>> this thing work correctly. I'd like to try and simplify.
>>
>>
>> Further reading:
>>
>>
>> http://techbackground.blogspot.co.uk/2013/06/path-mtu-discovery-and-gre.html
>>
>> http://lists.openstack.org/pipermail/openstack/2013-October/001778.html
>>
>>
>> https://ask.openstack.org/en/question/6140/quantum-neutron-gre-slow-performance/
>>
>>
>> https://ask.openstack.org/en/question/12499/forcing-mtu-to-1400-via-etcneutrondnsmasq-neutronconf-per-daniels/
>>
>> http://blog.systemathic.ch/2015/03/05/openstack-mtu-pitfalls-with-tunnels/
>>
>> https://twitter.com/search?q=openstack%20neutron%20MTU
>>
>> --
>> Sean M. Collins
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>


-- 
Kevin Benton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160118/65ac2b85/attachment.html>


More information about the OpenStack-dev mailing list