[openstack-dev] [Neutron] MTU configuration pain

Fox, Kevin M Kevin.Fox at pnnl.gov
Tue Jan 26 01:16:03 UTC 2016


Another place to look...
I've had to use network_device_mtu=9000 in nova's config as well to get mtu's working smoothly.

Thanks,
Kevin
________________________________
From: Matt Kassawara [mkassawara at gmail.com]
Sent: Monday, January 25, 2016 5:00 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Neutron] MTU configuration pain

Results from the Open vSwitch agent...

I highly recommend reading further, but here's the TL;DR: Using physical network interfaces with MTUs larger than 1500 reveals problems in several places, but only involving Linux components rather than Open vSwitch components (such as br-int) on both the controller and compute nodes. Most of the problems involve MTU disparities in security group bridge components on the compute node.

First, review the OpenStack bits and resulting network components in the environment [1] and see that a typical 'ping' works using IPv4 and IPv6 [2].

[1] https://gist.github.com/ionosphere80/23655bedd24730d22c89
[2] https://gist.github.com/ionosphere80/5f309e7021a830246b66

Note: The tcpdump output in each case references up to seven points: neutron router gateway on the public network (qg), namespace end of the neutron router interface on the private network (qr), controller node end of the VXLAN network (underlying interface), compute node end of the VXLAN network (underlying interface), Open vSwitch end of the veth pair for the security group bridge (qvo), Linux bridge end of the veth pair for the security group bridge (qvb), and the bridge end of the tap for the VM (tap).

I can use SSH to access the VM because every component between my host and the VM supports at least a 1500 MTU. So, let's configure the VM network interface to use the proper MTU of 9000 minus the VXLAN protocol overhead of 50 bytes... 8950... and try SSH again.

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc pfifo_fast qlen 1000
    link/ether fa:16:3e:ea:22:3a brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.3/24<http://172.16.1.3/24> brd 172.16.1.255 scope global eth0
    inet6 fd00:100:52:1:f816:3eff:feea:223a/64 scope global dynamic
       valid_lft 86396sec preferred_lft 14396sec
    inet6 fe80::f816:3eff:feea:223a/64 scope link
       valid_lft forever preferred_lft forever

Contrary to the Linux bridge experiment, I can still use SSH to access the VM. Why?

Let's ping with a payload size of 8922 for IPv4 and 8902 for IPv6, the maximum for a VXLAN segment with 8950 MTU.

# ping -c 1 -s 8922 -M do 10.100.52.102
PING 10.100.52.102 (10.100.52.102) 8922(8950) bytes of data.
>From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 1500)

--- 10.100.52.102 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping6 -c 1 -s 8902 -M do fd00:100:52:1:f816:3eff:feea:223a
PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a) 8902 data bytes
>From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=1500

--- fd00:100:52:1:f816:3eff:feea:223a ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

Look at the tcpdump output [3]. The router namespace, operating at layer-3, sees the MTU discrepancy between inbound packet and the neutron router gateway on the public network and returns an ICMP "fragmentation needed" or "packet too big" message to the sender. The sender uses the MTU value in the ICMP packet to recalculate the length of the first packet and caches it for future packets.

[3] https://gist.github.com/ionosphere80/4e1389a34fd3a628b294

Although PTMUD enables communication between my host and the VM, it limits MTU to 1500 regardless of the MTU between the router namespace and VM and therefore could impact performance on 10 Gbps or faster networks. Also, it does not address the MTU disparity between a VM and network components on the compute node. If a VM uses a 1500 or smaller MTU, it cannot send packets that exceed the MTU of the tap interface, veth pairs, and bridge on the compute node. In this situation which seems fairly typical for operators trying to work around MTU problems, communication between a host (outside of OpenStack) and a VM always works. However, what if a VM uses a MTU larger than 1500 and attempts to send a large packet? The bridge or veth pairs would drop it because of the MTU disparity.

Using observations from the Linux bridge experiment, let's configure the MTU of the interfaces in the router namespace to match the interfaces outside of the namespace. The public network (gateway) interface MTU becomes 9000 and the private network router interfaces (IPv4 and IPv6) become 8950.

31: qr-d744191c-9d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether fa:16:3e:34:67:40 brd ff:ff:ff:ff:ff:ff
32: qr-ae54b450-b4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether fa:16:3e:d4:f1:63 brd ff:ff:ff:ff:ff:ff
33: qg-e3303f07-e7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/ether fa:16:3e:70:09:54 brd ff:ff:ff:ff:ff:ff

Let's ping again with a payload size of 8922 for IPv4, the maximum for a VXLAN segment with 8950 MTU, and look at the tcpdump output [4]. For brevity, I'm only showing IPv4 because IPv6 provides similar results.

# ping -c 1 -s 8922 -M do 10.100.52.102

[4] https://gist.github.com/ionosphere80/703925fbe4ae53e78445

The packet traverses the Open vSwitch infrastructure including the overlay. However, looking at the compute node, the integration bridge drops the packet because the MTU changes from 8950 to 1500 over a layer-2 connection without a router.

Let's increase the MTU on the OVS end of the veth pair to 8950, and ping again using the same payload. For brevity, I'm only showing tcpdump output for interfaces on the compute node [5].

# ping -c 1 -s 8922 -M do 10.100.52.102

[5] https://gist.github.com/ionosphere80/0f0d4cf346ee81e43cbb

The packet gets one step further. The veth pair between the Open vSwitch integration bridge and security group bridge drops the packet because the MTU changes from 8950 to 1500 over a layer-2 connection without a router.

Let's increase the MTU on the Linux bridge end of the veth pair to 8950 and ping again using the same payload. For brevity, I'm only showing tcpdump output for interfaces on the compute node [6].

[6] https://gist.github.com/ionosphere80/dd9270aae23ad286d9cd

The packet gets one step further. The VM tap interface drops the packet because the MTU changes from 8950 to 1500 over a layer-2 connection without a router.

Let's perform the final MTU increase on the VM tap interface and ping again using the same payload. For brevity, I'm only showing tcpdump output for interfaces on the compute node [7].

[7] https://gist.github.com/ionosphere80/05e02c7a753fad4b2964

Ping works.

Let's ping with a payload size of 8923 for IPv4 and 8903 for IPv6, one byte larger than the maximum for a VXLAN segment with 8950 MTU. The router namespace, operating at layer-3, sees the MTU discrepancy between the two interfaces in the namespace and returns an ICMP "fragmentation needed" or "packet too big" message to the sender. The sender uses the MTU value in the ICMP packet to recalculate the length of the first packet and caches it for future packets.

# ping -c 1 -s 8923 -M do 10.100.52.102
PING 10.100.52.102 (10.100.52.102) 8923(8951) bytes of data.
>From 10.100.52.102 icmp_seq=1 Frag needed and DF set (mtu = 8950)

--- 10.100.52.102 ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ping6 -c 1 -s 8903 -M do fd00:100:52:1:f816:3eff:feea:223a
PING fd00:100:52:1:f816:3eff:feea:223a(fd00:100:52:1:f816:3eff:feea:223a) 8903 data bytes
>From fd00:100:52::101 icmp_seq=1 Packet too big: mtu=8950

--- fd00:100:52:1:f816:3eff:feea:223a ping statistics ---
1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms

# ip route get to 10.100.52.102
10.100.52.102 dev eth1  src 10.100.52.45
    cache  expires 499sec mtu 8950

# ip route get to fd00:100:52:1:f816:3eff:feea:223a
fd00:100:52:1:f816:3eff:feea:223a from :: via fd00:100:52::101 dev eth1  src fd00:100:52::45  metric 0
    cache  expires 544sec mtu 8950

This experiment reveals a number of problems with the Open vSwitch agent, none of which seem to involve Open vSwitch itself.

1) Like the Linux bridge agent, interfaces in namespaces assume a 1500 MTU which prevents communication with VMs using larger packets. However, the method OVS uses to manage interfaces in namespaces permits them to generate ICMP messages for PMTUD that notify senders of the correct MTU.
2) Although interfaces in namespaces generate ICMP messages for PMTUD, they assume a 1500 MTU and therefore limit performance on 10 Gbps or faster networks regardless of the MTU between the router namespace and a VM.
3) The Open vSwitch agent creates Linux bridges on compute nodes to implement security groups. These bridges do not contain ports on physical network interfaces (using a larger MTU) and therefore assume a 1500 MTU. The veth pairs and tap interfaces also assume a 1500 MTU. Unlike the Linux bridge agent, only increasing the MTU of the namespace end of the veth pair for the neutron router interface on the private network simply moves the problem to the security group bridge components. The latter components (qvo, qvb, and tap) should all use the MTU of the physical network minus the overlay protocol overhead, or 8950 for VXLAN in this particular experiment.

Matt

On Mon, Jan 25, 2016 at 12:10 PM, Rick Jones <rick.jones2 at hpe.com<mailto:rick.jones2 at hpe.com>> wrote:
On 01/24/2016 07:43 PM, Ian Wells wrote:
Also, I say 9000, but why is 9000 even the right number?

While that may have been a rhetorical question...

Because that is the value Alteon picked in the late 1990s when they created the de facto standard for "Jumbo Frames" by including it in their Gigabit Ethernet kit as a way to enable the systems of the day to have a hope of getting link-rate :)

Perhaps they picked 9000 because it was twice the 4500 of FDDI, which itself was selected to allow space for 4096 bytes of data and then a good bit of headers.



rick jones

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe<http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160126/88150095/attachment.html>


More information about the OpenStack-dev mailing list