[openstack-dev] [Neutron] MTU configuration pain

Adam Lawson alawson at aqorn.com
Sat Jan 23 19:27:46 UTC 2016


For the sake of over-simplification, is there ever a reason to NOT enable
jumbo frames in a cloud/SDN context where most of the traffic is between
virtual elements that all support it? I understand that some switches do
not support it and traffic form the web doesn't support it either but
besides that, seems like a default "jumboframes = 1" concept would work
just fine to me.

Then again I'm all about making OpenStack easier to consume so my ideas
tend to gloss over special use cases with special requirements.


*Adam Lawson*

AQORN, Inc.
427 North Tatnall Street
Ste. 58461
Wilmington, Delaware 19801-2230
Toll-free: (844) 4-AQORN-NOW ext. 101
International: +1 302-387-4660
Direct: +1 916-246-2072

On Fri, Jan 22, 2016 at 7:13 PM, Matt Kassawara <mkassawara at gmail.com>
wrote:

> The fun continues, now using an OpenStack deployment on physical hardware
> that supports jumbo frames with 9000 MTU and IPv4/IPv6. This experiment
> still uses Linux bridge for consistency. I'm planning to run similar
> experiments with Open vSwitch and Open Virtual Network (OVN) in the next
> week.
>
> I highly recommend reading further, but here's the TL;DR: Using physical
> network interfaces with MTUs larger than 1500 reveals an additional problem
> with veth pair for the neutron router interface on the public network.
> Additionally, IP protocol version does not impact MTU calculation for
> Linux bridge.
>
> First, review the OpenStack bits and resulting network components in the
> environment [1]. In the first experiment, public cloud network limitations
> prevented truly seeing how Linux bridge (actually the kernel) handles
> physical network interfaces with MTUs larger than 1500. In this experiment,
> we see that it automatically calculates the proper MTU for bridges and
> VXLAN interfaces using the MTU of parent devices. Also, see that a regular
> 'ping' works between the host outside of the deployment and the VM [2].
>
> [1] https://gist.github.com/ionosphere80/a3725066386d8ca4c6d7
> [2] https://gist.github.com/ionosphere80/a8d601a356ac6c6274cb
>
> Note: The tcpdump output in each case references up to six points: neutron
> router gateway on the public network (qg), namespace end of the veth pair
> for the neutron router interface on the private network (qr), bridge end of
> the veth pair for router interface on the private network (tap), controller
> node end of the VXLAN network (underlying interface), compute node end of
> the VXLAN network (underlying interface), and the bridge end of the tap for
> the VM (tap).
>
> In the first experiment, SSH "stuck" because of a MTU mismatch on the veth
> pair between the router namespace and private network bridge. In this
> experiment, SSH works because the VM network interface uses a 1500 MTU and
> all devices along the path between the host and VM use a 1500 or larger
> MTU. So, let's configure the VM network interface to use the proper MTU of
> 9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH
> again.
>
> 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast qlen
> 1000
>     link/ether fa:16:3e:46:ac:d3 brd ff:ff:ff:ff:ff:ff
>     inet 172.16.1.3/24 brd 172.16.1.255 scope global eth0
>     inet6 fd00:100:52:1:f816:3eff:fe46:acd3/64 scope global dynamic
>        valid_lft 86395sec preferred_lft 14395sec
>     inet6 fe80::f816:3eff:fe46:acd3/64 scope link
>        valid_lft forever preferred_lft forever
>
> SSH doesn't work with IPv4 or IPv6. Adding a slight twist to the first
> experiment, I don't even see the large packet traversing the neutron
> router gateway on the public network. So, I began a tcpdump closer to the
> source on the bridge end of the veth pair for the neutron router
> interface on the public network.
>
> Looking at [3], the veth pair between the router namespace and private
> network bridge drops the packet. The MTU changes over a layer-2 connection
> without a router, similar to connecting two switches with different MTUs.
> Even if it could participate in PMTUD, the veth pair lacks an IP address
> and therefore cannot originate ICMP messages.
>
> [3] https://gist.github.com/ionosphere80/ec83d0955c79b05ea381
>
> Using observations from the first experiment, let's configure the MTU of
> the interfaces in the qrouter namespace to match the other end of their
> respective veth pairs. The public network (gateway) interface MTU becomes
> 9000 and the private network router interfaces (IPv4 and IPv6) become 8950.
>
> 2: qr-49b27408-04: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc
> pfifo_fast state UP mode DEFAULT group default qlen 1000
>     link/ether fa:16:3e:e5:43:1c brd ff:ff:ff:ff:ff:ff
> 3: qr-b7e0ef22-32: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc
> pfifo_fast state UP mode DEFAULT group default qlen 1000
>     link/ether fa:16:3e:16:01:92 brd ff:ff:ff:ff:ff:ff
> 4: qg-7bbe8e38-cc: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc
> pfifo_fast state UP mode DEFAULT group default qlen 1000
>     link/ether fa:16:3e:2b:c1:fd brd ff:ff:ff:ff:ff:ff
>
> Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the
> maximum for a VXLAN segment with 8950 MTU, and look at the tcpdump output
> [4]. For brevity, I'm only showing tcpdump output from the VM tap
> interface. Ping operates normally.
>
> # ping -c 1 -s 8922 -M do 10.100.52.104
>
> # ping -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:fe46:acd3
>
> [4] https://gist.github.com/ionosphere80/85339b587bb9b2693b07
>
> Let's ping with a payload size of 8923 for IPv4 and 8903 for IPv6, one
> byte larger than the maximum for a VXLAN segment with 8950 MTU. The
> router namespace, operating at layer-3, sees the MTU discrepancy between
> the two interfaces in the namespace and returns an ICMP "fragmentation
> needed" or "packet too big" message to the sender. The sender uses the MTU
> value in the ICMP packet to recalculate the length of the first packet and
> caches it for future packets.
>
> # ping -c 1 -s 8923 -M do 10.100.52.104
> PING 10.100.52.104 (10.100.52.104) 8923(8951) bytes of data.
> From 10.100.52.104 icmp_seq=1 Frag needed and DF set (mtu = 8950)
>
> --- 10.100.52.104 ping statistics ---
> 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
>
> # ping6 -c 1 -s 8903 -M do fd00:100:52:1:f816:3eff:fe46:acd3
> PING fd00:100:52:1:f816:3eff:fe46:acd3(fd00:100:52:1:f816:3eff:fe46:acd3)
> 8903 data bytes
> From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=8950
>
> --- fd00:100:52:1:f816:3eff:fe46:acd3 ping statistics ---
> 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
>
> # ip route get to 10.100.52.104
> 10.100.52.104 dev eth1  src 10.100.52.45
>     cache  expires 596sec mtu 8950
>
> # ip route get to fd00:100:52:1:f816:3eff:fe46:acd3
> fd00:100:52:1:f816:3eff:fe46:acd3 from :: via fd00:100:52::101 dev eth1
>  src fd00:100:52::45  metric 0
>     cache  expires 556sec mtu 8950
>
> Finally, let's try SSH.
>
> # ssh cirros at 10.100.52.104
> cirros at 10.100.52.104's password:
> $
>
> # ssh cirros at fd00:100:52:1:f816:3eff:fe46:acd3
> cirros at fd00:100:52:1:f816:3eff:fe46:acd3's password:
> $
>
> SSH works for both IPv4 and IPv6.
>
> This experiment reaches the same conclusion as the first experiment.
> However, using physical hardware that supports jumbo frames reveals an
> additional problem with the veth pair for the neutron router interface on
> the public network. For any MTU, we can address the egress MTU disparity
> (from the VM) by advertising the MTU of the overlay network to the VM via
> DHCP/RA or using manual interface configuration. Additionally, IP
> protocol version does not impact MTU calculation for Linux bridge.
>
> Hopefully moving to physical hardware makes this experiment easier to
> understand and the conclusion more useful for realistic networks.
>
> Matt
>
> On Wed, Jan 20, 2016 at 11:18 AM, Rick Jones <rick.jones2 at hpe.com> wrote:
>
>> On 01/20/2016 08:56 AM, Sean M. Collins wrote:
>>
>>> On Tue, Jan 19, 2016 at 08:15:18AM EST, Matt Kassawara wrote:
>>>
>>>> No. However, we ought to determine what happens when both DHCP and RA
>>>> advertise it.
>>>>
>>>
>>> We'd have to look at the RFCs for how hosts are supposed to behave since
>>> IPv6 has a minimum MTU of 1280 bytes while IPv4's minimum mtu is 576
>>> (what is this, an MTU for ants?).
>>>
>>
>> Quibble - 576 is the IPv4 minimum, maximum MTU.  That is to say a
>> compliant IPv4 implementation must be able to reassemble datagrams of at
>> least 576 bytes.
>>
>> If memory serves, the actual minimum MTU for IPv4 is 68 bytes.
>>
>> rick jones
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160123/e3bccc9e/attachment.html>


More information about the OpenStack-dev mailing list