[Openstack] MTU/Fragmentation/Retransmission Problem in Juno using GRE

Eren Türkay erent at skyatlas.com
Mon Jan 19 14:31:19 UTC 2015


Hello,

This will be a long e-mail and I will present my findings on the
$subject. I have been debugging this problem for 6 days and I have
pinpointed where the problem lies but I haven't been able to fix it. I
would really appreciate if you can read it. I hope this e-mail will be
reference in the mailing list for other people experiencing the
same/similar problem.

Unfortunately, I am stuck at fixing the problem. Any help is appreciated.


=== TL;DR ===
ICMP packets from VM are fragmented normally, seen in tap interface
as-is but they are reconstructed in interfaces above tap (qbr, qvr,
qvo). They don't make their way out of compute node using GRE tunnel.

echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptable

on compute node causes fragmented packets to go as-is, not reconstructed
in above interfaces (qbr, qvr, etc). They make their way out of GRE
tunnel, reaches network namespace, network namespace attempts to reply
it but it doesn't reach GRE tunnel on network node.

Lowering the MTU value in every interface in router network namespace to
1454 (same as VM) fixes ICMP problem. ICMP packet with any length
reaches from vm to router and vice-versa.

However, regular TCP connections do not work. I see a lot of
retransmission. Even the simple nc connection is unreliable from VM to
router namespace. Iperf shows 23KBit of speed.

Here is the detailed problem description.

=== Symptoms ===

These are the symptoms I experience.

1- Ping works but cannot SSH into VM
2- I cannot download anything inside VM, the connection is too slow.
3- A lot of TCP retransmission occurs
4- VM cannot communicate with metadata server (maybe related with 2/3?)


=== Install and Infrastructure Information ==

I have followed the official juno document step-by-step (double checked
to see if I mis-configured anything). Neutron is configured using ml2,
Openvswitch, and gre. Just as suggested in the documentation.

I have 3 physical machines with 2 NICs (controller, network, and
compute). em1 interface is my management and data network and it has a
separate switch (10.20.0.0/24). Network and Compute node communicates
using GRE on this address. em2 is connected to other switch which act as
an outside network (192.168.88.0/24). So, external and internal network
is physically separate.

Hosts run Ubuntu Server 14.04. They have OpenStack Juno. Kernel version
is 3.13.0-32-generic. Openvswitch version number is
2.0.1+git20140120-0ubuntu2. KVM is used as a hypervisor.

VMs have MTU 1454 configured in dnsmasq.conf as written in official Juno
documentation. (checked this inside VM as well)

VM network is 10.0.0.0/24. I have 1 VM for testing and it has IP address
of 10.0.0.8. Its router IP address is 10.0.0.1. This router has a
gateway address of 192.168.88.1

All the bridges (created by neutron, agents, etc), and network
interfaces in physical hosts have MTU of 1500.

GRO and TSO is off on network and compute nodes (ethtool -K em1 gro/tso off)


=== Findings ===

I will omit what I tried to get here -it is a long way :(- and will
present the issue straight. I realized that "ping -s 1430 10.0.0.1"
inside VM works, but "ping -s 1431 10.0.0.1" does not work. I checked to
see if it is the same the other way around inside network namespace for
this network in Network node. Running "ip netns exec <qrouterxxxx> ping
-s 1430 10.0.0.8" works, but -s 1431 does not.

1- Now, looking at the problem from VM side, I ran tcpdump nearly all
over the places. It appears that the problem lies in qvo/qbr/qvb bridges
as explained in [0].

When sending ICMP packets, they are fragmented as expected in tap
interface. However, the fragmented packet is reconstructed just after
tap interface, in qbrxxx, and carried as reconstructed all the way long
to qvbxxx/qvoxxx. So, the packet never gets out on GRE tunnel.


2- Looking from the network namespace in network node, I checked the MTU
values of the interfaces. qrxx and qgxx have MTU of 1500. Then I ran
tcpdump on qrxxx where the packets to/from VM should be seen. In
addition, I ran tcpdump on em1 (management interface where GRE packets
should be seen) on both network and compute node.

With "ping -s 1431 10.0.0.8", I saw that packets were fragmented inside
network namespace. However, the first packet wasn't seen in GRE (em1),
but the second fragmented packet was seen, it made its way out. Since it
was not a full packet, ping failed from network namespace to VM.


=== Attempts to Solve The Problem ===

I searched the bridge fragmentation problem and it was suggested to
disable "bridge-nf-call-iptables". I ran the following command in
compute node:

"echo 0 > /proc/sys/net/bridge/bridge-nf-call-iptables"

With this setting, fragmented packets from vm goes as-is in tap, and
other qvo/qvb interfaces. They are not reconstructed. They make their
way out GRE tunnel, and reach the router namespace. Router namespace
attempts to reply it but it fails to reply. The reply packet never goes
out in GRE tunel. This tcpdump is attached as
"ping-from-vm-to-router-with-1431-bridge-nf-call-iptables-off.txt"

I suspected the MTU settings inside router namespace so I lowered the
MTU value inside namespace to 1454 (same as VM). With this setting, I
can now ping the VM from router namespace with -s 1431. It seems that
lowering the MTU values in router namespace fixes the problem. I tried
to ping router from VM, and VM from router and it works in two
direction. Whatever setting I put in -s (2000, 3000, etc), I get replies.

It seemed that the problem was fixed. However, when I tried making a
regular connection, it failed. I setup iperf server running inside
network namespace. Connecting from VM to router using iperf, I got
23.3KBit speed. Giving up iperf, I tried simple netcat connection and it
was unreliable, slow. In tcpdump, I saw a lot of tcp retranssmissions.
The tcpdump of simple connection from VM to router namespace using
netcat is attached. (nc -l -p 9999 is run on router namespace). This
dump was gathered on tap interface of VM on compute node.


=== Summary ===
ICMP packets are OK with "bridge-nf-call-iptables" off on compute node
and by lowering the MTU value inside router namespace. However, regular
TCP connection does not work and I see lots of tcp retransmissions.

I think disabling "bridge-nf-call-iptables" causes the security group to
be disabled as well, and it doesn't look right.

Those are the findings I've gathered and I haven't fixed the problem
yet. I am giving the links I found for reference while attacking this
problem. They look similar.


http://lists.openstack.org/pipermail/openstack-dev/2014-January/024995.html

https://bugs.launchpad.net/fuel/+bug/1256289
https://bugs.launchpad.net/openstack-manuals/+bug/1322799


Thank you for reading so far. I appreciate it.

Regards,
Eren

[0] http://openvswitch.org/pipermail/discuss/2014-May/013964.html

-- 
System Administrator
https://skyatlas.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tap-interface-on-compute-from-vm-to-router-using-netcat.pcap
Type: application/vnd.tcpdump.pcap
Size: 3303 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20150119/ca57b0b7/attachment.pcap>
-------------- next part --------------
compute node, tap interface
===========================
15:30:21.604640 IP (tos 0x0, ttl 64, id 49494, offset 0, flags [+], proto ICMP (1), length 1452)
    10.0.0.8 > 10.0.0.1: ICMP echo request, id 1997, seq 1, length 1432
15:30:21.604709 IP (tos 0x0, ttl 64, id 49494, offset 1432, flags [none], proto ICMP (1), length 27)
    10.0.0.8 > 10.0.0.1: icmp


compute node, em1 interface, GRE tunnel
=======================================
15:30:21.604939 IP (tos 0x0, ttl 64, id 34168, offset 0, flags [DF], proto GRE (47), length 1494)
    compute1 > network: GREv0, Flags [key present], key=0x1, length 1474
        IP (tos 0x0, ttl 64, id 49494, offset 0, flags [+], proto ICMP (1), length 1452)
    10.0.0.8 > 10.0.0.1: ICMP echo request, id 1997, seq 1, length 1432
15:30:21.604951 IP (tos 0x0, ttl 64, id 34169, offset 0, flags [DF], proto GRE (47), length 69)
    compute1 > network: GREv0, Flags [key present], key=0x1, length 49
        IP (tos 0x0, ttl 64, id 49494, offset 1432, flags [none], proto ICMP (1), length 27)
    10.0.0.8 > 10.0.0.1: icmp


network node, em1 interface, GRE tunnel
=======================================
15:30:21.579658 IP (tos 0x0, ttl 64, id 34168, offset 0, flags [DF], proto GRE (47), length 1494)
    compute1 > network: GREv0, Flags [key present], key=0x1, length 1474
        IP (tos 0x0, ttl 64, id 49494, offset 0, flags [+], proto ICMP (1), length 1452)
    10.0.0.8 > 10.0.0.1: ICMP echo request, id 1997, seq 1, length 1432
15:30:21.579692 IP (tos 0x0, ttl 64, id 34169, offset 0, flags [DF], proto GRE (47), length 69)
    compute1 > network: GREv0, Flags [key present], key=0x1, length 49
        IP (tos 0x0, ttl 64, id 49494, offset 1432, flags [none], proto ICMP (1), length 27)
    10.0.0.8 > 10.0.0.1: icmp


network namespace on network node, qrxxx interface
==================================================
15:30:21.579937 IP (tos 0x0, ttl 64, id 49494, offset 1432, flags [none], proto ICMP (1), length 27)       
    10.0.0.8 > 10.0.0.1: ip-proto-1
15:30:21.579952 IP (tos 0x0, ttl 64, id 49494, offset 0, flags [+], proto ICMP (1), length 1452)
    10.0.0.8 > 10.0.0.1: ICMP echo request, id 1997, seq 1, length 1432

15:30:21.579990 IP (tos 0x0, ttl 64, id 41204, offset 0, flags [none], proto ICMP (1), length 1459)
    10.0.0.1 > 10.0.0.8: ICMP echo reply, id 1997, seq 1, length 1439


The last reply doesn't make its way to GRE tunnel. It is not seen in em1 interface.



More information about the Openstack mailing list