[Openstack] Directional network performance issues with Neutron + OpenvSwitch

Speichert,Daniel djs428 at drexel.edu
Thu Oct 24 14:22:27 UTC 2013


Hello everyone,

It seems we also ran into the same issue.

We are running Ubuntu Saucy with OpenStack Havana from Ubuntu Cloud archives (precise-updates).

The download speed to the VMs increased from 5 Mbps to maximum after enabling ovs_use_veth. Upload speed from the VMs is still terrible (max 1 Mbps, usually 0.04 Mbps).

Here is the iperf between the instance and L3 agent (network node) inside namespace.

root at cloud:~# ip netns exec qrouter-a29e0200-d390-40d1-8cf7-7ac1cef5863a  iperf -c 10.1.0.24 -r
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 10.1.0.24, TCP port 5001
TCP window size:  585 KByte (default)
------------------------------------------------------------
[  7] local 10.1.0.1 port 37520 connected with 10.1.0.24 port 5001
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec   845 MBytes   708 Mbits/sec
[  6] local 10.1.0.1 port 5001 connected with 10.1.0.24 port 53006
[  6]  0.0-31.4 sec   256 KBytes  66.7 Kbits/sec

We are using Neutron OpenVSwitch with GRE and namespaces.

A side question: the documentation says to disable namespaces with GRE and enable them with VLANs. It was always working well for us on Grizzly with GRE and namespaces and we could never get it to work without namespaces. Is there any specific reason why the documentation is advising to disable it?

Regards,
Daniel

From: Martinx - ジェームズ [mailto:thiagocmartinsc at gmail.com]
Sent: Thursday, October 24, 2013 3:58 AM
To: Aaron Rosen
Cc: openstack at lists.openstack.org
Subject: Re: [Openstack] Directional network performance issues with Neutron + OpenvSwitch

Hi Aaron,

Thanks for answering!     =)

Lets work...

---

TEST #1 - iperf between Network Node and its Uplink router (Data Center's gateway "Internet") - OVS br-ex / eth2

# Tenant Namespace route table

root at net-node-1:~# ip netns exec qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 ip route
default via 172.16.0.1 dev qg-50b615b7-c2
172.16.0.0/20<http://172.16.0.0/20> dev qg-50b615b7-c2  proto kernel  scope link  src 172.16.0.2
192.168.210.0/24<http://192.168.210.0/24> dev qr-a1376f61-05  proto kernel  scope link  src 192.168.210.1<tel:192.168.210.1>

# there is a "iperf -s" running at 172.16.0.1 "Internet", testing it

root at net-node-1:~# ip netns exec qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 iperf -c 172.16.0.1
------------------------------------------------------------
Client connecting to 172.16.0.1, TCP port 5001
TCP window size: 22.9 KByte (default)
------------------------------------------------------------
[  5] local 172.16.0.2 port 58342 connected with 172.16.0.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec   668 MBytes   559 Mbits/sec
---

---

TEST #2 - iperf on one instance to the Namespace of the L3 agent + uplink router

# iperf server running within Tenant's Namespace router

root at net-node-1:~# ip netns exec qrouter-46cb8f7a-a3c5-4da7-ad69-4de63f7c34f1 iperf -s

-

# from instance-1

ubuntu at instance-1:~$ ip route
default via 192.168.210.1<tel:192.168.210.1> dev eth0  metric 100
192.168.210.0/24<http://192.168.210.0/24> dev eth0  proto kernel  scope link  src 192.168.210.2<tel:192.168.210.2>

# instance-1 performing tests against net-node-1 Namespace above

ubuntu at instance-1:~$ iperf -c 192.168.210.1<tel:192.168.210.1>
------------------------------------------------------------
Client connecting to 192.168.210.1<tel:192.168.210.1>, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.210.2<tel:192.168.210.2> port 43739 connected with 192.168.210.1<tel:192.168.210.1> port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   484 MBytes   406 Mbits/sec

# still on instance-1, now against "External IP" of its own Namespace / Router

ubuntu at instance-1:~$ iperf -c 172.16.0.2
------------------------------------------------------------
Client connecting to 172.16.0.2, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.210.2<tel:192.168.210.2> port 34703 connected with 172.16.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   520 MBytes   436 Mbits/sec

# still on instance-1, now against the Data Center UpLink Router

ubuntu at instance-1:~$ iperf -c 172.16.0.1
------------------------------------------------------------
Client connecting to 172.16.0.1, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.210.4<tel:192.168.210.4> port 38401 connected with 172.16.0.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   324 MBytes   271 Mbits/sec
---

This latest test shows only 271 Mbits/s! I think it should be at least, 400~430 MBits/s... Right?!

---

TEST #3 - Two instances on the same hypervisor

# iperf server

ubuntu at instance-2:~$ ip route
default via 192.168.210.1<tel:192.168.210.1> dev eth0  metric 100
192.168.210.0/24<http://192.168.210.0/24> dev eth0  proto kernel  scope link  src 192.168.210.4<tel:192.168.210.4>

ubuntu at instance-2:~$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.210.4<tel:192.168.210.4> port 5001 connected with 192.168.210.2 port 45800
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  4.61 GBytes  3.96 Gbits/sec

# iperf client

ubuntu at instance-1:~$ iperf -c 192.168.210.4
------------------------------------------------------------
Client connecting to 192.168.210.4, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.210.2 port 45800 connected with 192.168.210.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  4.61 GBytes  3.96 Gbits/sec
---

---

TEST #4 - Two instances on different hypervisors - over GRE

root at instance-2:~# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local 192.168.210.4 port 5001 connected with 192.168.210.2 port 34640
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec   237 MBytes   198 Mbits/sec


root at instance-1:~# iperf -c 192.168.210.4
------------------------------------------------------------
Client connecting to 192.168.210.4, TCP port 5001
TCP window size: 21.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.210.2 port 34640 connected with 192.168.210.4 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   237 MBytes   198 Mbits/sec
---

I just realized how slow is my intra-cloud (intra-VM) communication...   :-/

---

TEST #5 - Two hypervisors - "GRE TUNNEL LAN" - OVS local_ip / remote_ip

# Same path of "TEST #4" but, testing the physical GRE path (where GRE traffic flows)

root at hypervisor-2:~$ iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
n[  4] local 10.20.2.57 port 5001 connected with 10.20.2.53 port 51694
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  1.09 GBytes   939 Mbits/sec

root at hypervisor-1:~# iperf -c 10.20.2.57
------------------------------------------------------------
Client connecting to 10.20.2.57, TCP port 5001
TCP window size: 22.9 KByte (default)
------------------------------------------------------------
[  3] local 10.20.2.53 port 51694 connected with 10.20.2.57 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.09 GBytes   939 Mbits/sec
---

About Test #5, I don't know why the GRE traffic (Test #4) doesn't reach 1Gbit/sec (only ~200Mbit/s ?), since its physical path is much faster (GIGALan). Plus, Test #3 shows a pretty fast speed when traffic flows only within a hypervisor (3.96Gbit/sec).

Tomorrow, I'll do this tests with netperf.

NOTE: I'm using Open vSwitch 1.11.0, compiled for Ubuntu 12.04.3, via "dpkg-buildpackage" and installed via "Debian / Ubuntu way". If I downgrade to 1.10.2 from Havana Cloud Archive, same results... I can downgrade it, if you guys tell me to do so.

BTW, I'll install another "Region", based on Havana on Ubuntu 13.10, with exactly the same configurations from my current Havana + Ubuntu 12.04.3, on top of the same hardware, to see if the problem still persist.

Regards,
Thiago

On 23 October 2013 22:40, Aaron Rosen <arosen at nicira.com<mailto:arosen at nicira.com>> wrote:


On Mon, Oct 21, 2013 at 11:52 PM, Martinx - ジェームズ <thiagocmartinsc at gmail.com<mailto:thiagocmartinsc at gmail.com>> wrote:
James,

I think I'm hitting this problem.

I'm using "Per-Tenant Routers with Private Networks", GRE tunnels and L3+DHCP Network Node.

The connectivity from behind my Instances is very slow. It takes an eternity to finish "apt-get update".


I'm curious if you can do the following tests to help pinpoint the bottle neck:

Run iperf or netperf between:
two instances on the same hypervisor - this will determine if it's a virtualization driver issue if the performance is bad.
two instances on different hypervisors.
one instance to the namespace of the l3 agent.






If I run "apt-get update" from within tenant's Namespace, it goes fine.

If I enable "ovs_use_veth", Metadata (and/or DHCP) stops working and I and unable to start new Ubuntu Instances and login into them... Look:

--
cloud-init start running: Tue, 22 Oct 2013 05:57:39 +0000. up 4.01 seconds
2013-10-22 06:01:42,989 - util.py[WARNING]: 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [3/120s]: url error [[Errno 113] No route to host]
2013-10-22 06:01:45,988 - util.py[WARNING]: 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [6/120s]: url error [[Errno 113] No route to host]
--


Do you see anything interesting in the neutron-metadata-agent log? Or it looks like your instance doesn't have a route to the default gw?


Is this problem still around?!

Should I stay away from GRE tunnels when with Havana + Ubuntu 12.04.3?

Is it possible to re-enable Metadata when ovs_use_veth = true ?

Thanks!
Thiago

On 3 October 2013 06:27, James Page <james.page at ubuntu.com<mailto:james.page at ubuntu.com>> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
On 02/10/13 22:49, James Page wrote:
>> sudo ip netns exec qrouter-d3baf1b1-55ee-42cb-a3f6-9629288e3221
>>> traceroute -n 10.5.0.2 -p 44444 --mtu traceroute to 10.5.0.2
>>> (10.5.0.2), 30 hops max, 65000 byte packets 1  10.5.0.2  0.950
>>> ms F=1500  0.598 ms  0.566 ms
>>>
>>> The PMTU from the l3 gateway to the instance looks OK to me.
> I spent a bit more time debugging this; performance from within
> the router netns on the L3 gateway node looks good in both
> directions when accessing via the tenant network (10.5.0.2) over
> the qr-XXXXX interface, but when accessing through the external
> network from within the netns I see the same performance choke
> upstream into the tenant network.
>
> Which would indicate that my problem lies somewhere around the
> qg-XXXXX interface in the router netns - just trying to figure out
> exactly what - maybe iptables is doing something wonky?
OK - I found a fix but I'm not sure why this makes a difference;
neither my l3-agent or dhcp-agent configuration had 'ovs_use_veth =
True'; I switched this on, clearing everything down, rebooted and now
I seem symmetric good performance across all neutron routers.

This would point to some sort of underlying bug when ovs_use_veth = False.


- --
James Page
Ubuntu and Debian Developer
james.page at ubuntu.com<mailto:james.page at ubuntu.com>
jamespage at debian.org<mailto:jamespage at debian.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQIcBAEBCAAGBQJSTTh6AAoJEL/srsug59jDmpEP/jaB5/yn9+Xm12XrVu0Q3IV5
fLGOuBboUgykVVsfkWccI/oygNlBaXIcDuak/E4jxPcoRhLAdY1zpX8MQ8wSsGKd
CjSeuW8xxnXubdfzmsCKSs3FCIBhDkSYzyiJd/raLvCfflyy8Cl7KN2x22mGHJ6z
qZ9APcYfm9qCVbEssA3BHcUL+st1iqMJ0YhVZBk03+QEXaWu3FFbjpjwx3X1ZvV5
Vbac7enqy7Lr4DSAIJVldeVuRURfv3YE3iJZTIXjaoUCCVTQLm5OmP9TrwBNHLsA
7W+LceQri+Vh0s4dHPKx5MiHsV3RCydcXkSQFYhx7390CXypMQ6WwXEY/a8Egssg
SuxXByHwEcQFa+9sCwPQ+RXCmC0O6kUi8EPmwadjI5Gc1LoKw5Wov/SEen86fDUW
P9pRXonseYyWN9I4MT4aG1ez8Dqq/SiZyWBHtcITxKI2smD92G9CwWGo4L9oGqJJ
UcHRwQaTHgzy3yETPO25hjax8ZWZGNccHBixMCZKegr9p2dhR+7qF8G7mRtRQLxL
0fgOAExn/SX59ZT4RaYi9fI6Gng13RtSyI87CJC/50vfTmqoraUUK1aoSjIY4Dt+
DYEMMLp205uLEj2IyaNTzykR0yh3t6dvfpCCcRA/xPT9slfa0a7P8LafyiWa4/5c
jkJM4Y1BUV+2L5Rrf3sc
=4lO4
-----END PGP SIGNATURE-----

_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : openstack at lists.openstack.org<mailto:openstack at lists.openstack.org>
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : openstack at lists.openstack.org<mailto:openstack at lists.openstack.org>
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20131024/374ea3e3/attachment.html>


More information about the Openstack mailing list