[openstack-dev] [Neutron] MTU configuration pain

Ian Wells ijw.ubuntu at cack.org.uk
Mon Jan 25 03:27:01 UTC 2016


I wrote the spec for the MTU work that's in the Neutron API today.  It
haunts my nightmares.  I learned so many nasty corner cases for MTU, and
you're treading that same dark path.

I'd first like to point out a few things that change the implications of
what you're reporting in strange ways. [1] points out even more strange
ways, but these are the notable ones from what I've been reading here...

RFC7348: "VTEPs MUST NOT fragment VXLAN packets. ... The destination VTEP
MAY silently discard such VXLAN fragments."  The VXLAN VTEP implementations
we use today may fragment, but it's not according to the RFC, and I
wouldn't rely that every implementation you come across knows to do it.
So, the largest L2 packet you can send over VXLAN is a function of path MTU.

Even if VXLAN is fragmenting, you actually actively want to avoid it
fragmenting, because - in the typical case of bulk TCP transfers using
max-MTU packets - you're *invisibly* fragmenting the packets into two and
adding about 80 bytes of overhead in the process and then reassembling them
at the far end.  You've just expicitly guaranteed that, just as you send
the most data, your connection will slow down. And the MTU problem will be
undetectable to the VMs (which can't find out that a VXLAN encapped packet
has been fragmented; the packet *they* sent didn't fragment, but the one
it's carried in did, not to mention the fragmentation didn't even happen at
an L3 node in the virtual network so DF and therefore PMTUD wouldn't work).

Path MTU is not fixed, because your path can vary according to network
weather (failures, congestion, whatever).  It's an oddity, and perhaps a
rarity, but you can get many weirdnesses: you fail over from one link to a
link with a smaller MTU and the path MTU shrinks; some switches are jumbo
frame and some aren't, so the path MTU might vary from host to host; and so
on.  Granted, these are weird cases, but the point here is that Openstack
cannot *discover* this number.  An installer might attempt something,
knowing how to read switch config; or it might attempt to validate a number
it's been given, as best it can; but even then it's best effort, it's not a
guarantee.  For all these reasons, the only way to really get the minimum
path MTU is from the operator themselves, which is why this is a
configuration parameter to Neutron (path_mtu).

The aim of the changes in the spec [1] were threefold:

1. To ensure that an app that absolutely required a certain minimum MTU to
operate could guarantee it would receive it
2. To allow the network to say what the MTU was, so that the VM could be
programmed accordingly
3. To ensure that the MTU for the network would - by default - settle on
the optimal value, per all the stuff above.

So what could we do in this environment to improve matters?

1. We should advertise MTU in the RA and DHCP messages that Openstack
sends.  I thought we'd already done this work, but this thread suggests not.

[Note, though, that you can't reliably set an MTU higher than 1500 on IPv6
using an RA, thanks to RFC4861 referencing RFC2464 which goes with the
standard, but not the practice, that the biggest ethernet packet is 1500
bytes.  You've been violating the standard all these years, you bad
people.  Unfortunately, Linux enforces this RA rule, albeit slightly
strangely.]

2. We should also put the MTU in any config-drive settings for VMs that
don't respect such things in DHCP and RAs, or don't do DHCP.  This is
Nova-side, reacting to the MTU property of the network.

3. Installers should determine the appropriate MTU settings on interfaces
and ensure they're set.  Openstack can't do this in some cases (VXLAN - no
interfaces) - and probably shouldn't in others (VLAN - the interface MTU is
input to the MTU selection algorithm above, and the installer should set
the interface MTU to match what the operator says the fabric MTU is).

4. We need to check the Neutron network drivers to see which ones are
accepting, but not properly respecting, the MTU setting on the network.  I
suspect we're short of testing to make sure that veths, bridges, switches
and so on are all correctly configured.

-- 
Ian.

[1] https://review.openstack.org/#/c/105989/ and
https://github.com/openstack/neutron-specs/blob/master/specs/kilo/mtu-selection-and-advertisement.rst


On 22 January 2016 at 19:13, Matt Kassawara <mkassawara at gmail.com> wrote:

> The fun continues, now using an OpenStack deployment on physical hardware
> that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment
> still uses Linux bridge for consistency. I'm planning to run similar
> experiments with Open vSwitch and Open Virtual Network (OVN) in the next
> week.
>
> I highly recommend reading further, but here's the TL;DR: Using physical
> network interfaces with MTUs larger than 1500 reveals an additional problem
> with veth pair for the neutron router interface on the public network.
> Additionally, IP protocol version does not impact MTU calculation for
> Linux bridge.
>
> First, review the OpenStack bits and resulting network components in the
> environment [1]. In the first experiment, public cloud network limitations
> prevented truly seeing how Linux bridge (actually the kernel) handles
> physical network interfaces with MTUs larger than 1500. In this experiment,
> we see that it automatically calculates the proper MTU for bridges and
> VXLAN interfaces using the MTU of parent devices. Also, see that a regular
> 'ping' works between the host outside of the deployment and the VM [2].
>
> [1] https://gist.github.com/ionosphere80/a3725066386d8ca4c6d7
> [2] https://gist.github.com/ionosphere80/a8d601a356ac6c6274cb
>
> Note: The tcpdump output in each case references up to six points: neutron
> router gateway on the public network (qg), namespace end of the veth pair
> for the neutron router interface on the private network (qr), bridge end of
> the veth pair for router interface on the private network (tap), controller
> node end of the VXLAN network (underlying interface), compute node end of
> the VXLAN network (underlying interface), and the bridge end of the tap for
> the VM (tap).
>
> In the first experiment, SSH "stuck" because of a MTU mismatch on the veth
> pair between the router namespace and private network bridge. In this
> experiment, SSH works because the VM network interface uses a 1500 MTU and
> all devices along the path between the host and VM use a 1500 or larger
> MTU. So, let's configure the VM network interface to use the proper MTU of
> 9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH
> again.
>
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast qlen
> 1000
>     link/ether fa:16:3e:46:ac:d3 brd ff:ff:ff:ff:ff:ff
>     inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0
>     inet6 fd00:100:52:1:f816:3eff:fe46:acd3/64 scope global dynamic
>        valid_lft 86395sec preferred_lft 14395sec
>     inet6 fe80::f816:3eff:fe46:acd3/64 scope link
>        valid_lft forever preferred_lft forever
>
> SSH doesn't work with IPv4 or IPv6. Adding a slight twist to the first
> experiment, I don't even see the large packet traversing the neutron
> router gateway on the public network. So, I began a tcpdump closer to the
> source on the bridge end of the veth pair for the neutron router
> interface on the public network.
>
> Looking at [3], the veth pair between the router namespace and private
> network bridge drops the packet. The MTU changes over a layer-2 connection
> without a router, similar to connecting two switches with different MTUs.
> Even if it could participate in PMTUD, the veth pair lacks an IP address
> and therefore cannot originate ICMP messages.
>
> [3] https://gist.github.com/ionosphere80/ec83d0955c79b05ea381
>
> Using observations from the first experiment, let's configure the MTU of
> the interfaces in the qrouter namespace to match the other end of their
> respective veth pairs. The public network (gateway) interface MTU becomes
> 9000 and the private network router interfaces (IPv4 and IPv6) become 8950.
>
> 2: qr-49b27408-04: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc
> pfifo_fast state UP mode DEFAULT group default qlen 1000
>     link/ether fa:16:3e:e5:43:1c brd ff:ff:ff:ff:ff:ff
> 3: qr-b7e0ef22-32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc
> pfifo_fast state UP mode DEFAULT group default qlen 1000
>     link/ether fa:16:3e:16:01:92 brd ff:ff:ff:ff:ff:ff
> 4: qg-7bbe8e38-cc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc
> pfifo_fast state UP mode DEFAULT group default qlen 1000
>     link/ether fa:16:3e:2b:c1:fd brd ff:ff:ff:ff:ff:ff
>
> Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the
> maximum for a VXLAN segment with 8950 MTU, and look at the tcpdump output
> [4]. For brevity, I'm only showing tcpdump output from the VM tap
> interface. Ping operates normally.
>
> # ping -c 1 -s 8922 -M do 10.100.52.104
>
> # ping -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:fe46:acd3
>
> [4] https://gist.github.com/ionosphere80/85339b587bb9b2693b07
>
> Let's ping with a payload size of 8923 for IPv4 and 8903 for IPv6, one
> byte larger than the maximum for a VXLAN segment with 8950 MTU. The
> router namespace, operating at layer-3, sees the MTU discrepancy between
> the two interfaces in the namespace and returns an ICMP "fragmentation
> needed" or "packet too big" message to the sender. The sender uses the MTU
> value in the ICMP packet to recalculate the length of the first packet and
> caches it for future packets.
>
> # ping -c 1 -s 8923 -M do 10.100.52.104
> PING 10.100.52.104 (10.100.52.104) 8923(8951) bytes of data.
> From 10.100.52.104 icmp_seq=1 Frag needed and DF set (mtu = 8950)
>
> --- 10.100.52.104 ping statistics ---
> 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
>
> # ping6 -c 1 -s 8903 -M do fd00:100:52:1:f816:3eff:fe46:acd3
> PING fd00:100:52:1:f816:3eff:fe46:acd3(fd00:100:52:1:f816:3eff:fe46:acd3)
> 8903 data bytes
> From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=8950
>
> --- fd00:100:52:1:f816:3eff:fe46:acd3 ping statistics ---
> 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
>
> # ip route get to 10.100.52.104
> 10.100.52.104 dev eth1  src 10.100.52.45
>     cache  expires 596sec mtu 8950
>
> # ip route get to fd00:100:52:1:f816:3eff:fe46:acd3
> fd00:100:52:1:f816:3eff:fe46:acd3 from :: via fd00:100:52::101 dev eth1
>  src fd00:100:52::45  metric 0
>     cache  expires 556sec mtu 8950
>
> Finally, let's try SSH.
>
> # ssh cirros at 10.100.52.104
> cirros at 10.100.52.104's password:
> $
>
> # ssh cirros at fd00:100:52:1:f816:3eff:fe46:acd3
> cirros at fd00:100:52:1:f816:3eff:fe46:acd3's password:
> $
>
> SSH works for both IPv4 and IPv6.
>
> This experiment reaches the same conclusion as the first experiment.
> However, using physical hardware that supports jumbo frames reveals an
> additional problem with the veth pair for the neutron router interface on
> the public network. For any MTU, we can address the egress MTU disparity
> (from the VM) by advertising the MTU of the overlay network to the VM via
> DHCP/RA or using manual interface configuration. Additionally, IP
> protocol version does not impact MTU calculation for Linux bridge.
>
> Hopefully moving to physical hardware makes this experiment easier to
> understand and the conclusion more useful for realistic networks.
>
> Matt
>
> On Wed, Jan 20, 2016 at 11:18 AM, Rick Jones <rick.jones2 at hpe.com> wrote:
>
>> On 01/20/2016 08:56 AM, Sean M. Collins wrote:
>>
>>> On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote:
>>>
>>>> No. However, we ought to determine what happens when both DHCP and RA
>>>> advertise it.
>>>>
>>>
>>> We'd have to look at the RFCs for how hosts are supposed to behave since
>>> IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576
>>> (what is this, an MTU for ants?).
>>>
>>
>> Quibble - 576 is the IPv4 minimum, maximum MTU.  That is to say a
>> compliant IPv4 implementation must be able to reassemble datagrams of at
>> least 576 bytes.
>>
>> If memory serves, the actual minimum MTU for IPv4 is 68 bytes.
>>
>> rick jones
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160124/4a04db0a/attachment.html>


More information about the OpenStack-dev mailing list