[Openstack-operators] [Openstack] Extreme network throughput tuning with KVM as hypervisor

Alejandro Comisario alejandro.comisario at mercadolibre.com
Wed Jan 15 15:58:59 UTC 2014


Hi Narayan, thanks fir the prompted response, let me try to give you some
insights.
Since like you said, our setup can reach the maximum bandwidth in some
tests, but we cant achieve the THROUGHPUT we want, we run avg of 14vms on
128GB of ram compute nodes, and while all those vms are runing, we run a
test between two compute nodes with a c++ aplication that sends 50 packets
per second (lower than our 1500bytes MTU) and waits from the response from
the target server lower than 10ms.
This test runs on the br100 interface on both compute nodes (passign
through eth0-eth1, bond0 and br100) and while all vms are running (using
high throughput low bandwidth applications) we see this simple tests
showing thenths of thousands responses higher than 10ms, actually 99% of
this slow responses are taking 20/21ms, i dont seem to find whats that
magic delay value means, we are starting to look traversing what interfaces
are adding the delay.

Let me show you what our settings look like regarding networking (i will
take the vms out of the picture)

COMPUTE HOST
------------
2x1Gb bonded interfaces (no jumbo frames, 1500MTU since jumbo frames are a
separate project)

Ethernet ring settings on both interfaces:

RX 256
TX 256


Ethernet txqueuelen on both interfaces:

txqueuelen 1000


sysctl settings:

net.ipv4.tcp_max_tw_buckets = 3600000
net.ipv4.tcp_max_syn_backlog = 30000
net.core.netdev_max_backlog = 50000
net.core.somaxconn = 16384
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.ipv4.tcp_congestion_control = cubic
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.tcp_fin_timeout = 5
net.ipv4.tcp_keepalive_time = 5
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
vm.swappiness = 0
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_ecn=1
net.ipv4.tcp_sack=1
net.ipv4.tcp_dsack=1
net.ipv4.route.flush = 1
net.ipv6.route.flush = 1
net.ipv4.netfilter.ip_conntrack_udp_timeout = 30
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_max = 1200000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120
net.ipv4.tcp_keepalive_time = 90

One other tip i can add is that allways the delay is on the RX side, this
means, the server responding.
So, we were thinking about going upper with ring or txqueuelen settings.

Any idea ?






*Alejandro Comisario #melicloud CloudBuilders*
Arias 3751, Piso 7 (C1430CRG)
Ciudad de Buenos Aires - Argentina
Cel: +549(11) 15-3770-1857
Tel : +54(11) 4640-8443


On Wed, Jan 15, 2014 at 12:32 AM, Narayan Desai <narayan.desai at gmail.com>wrote:

> We don't have a workload remotely like that (generally, we have a lot more
> demand for bandwidth, but we also generally run faster networks than that
> as well), but 1k pps sounds awfully low. Like low by several orders of
> magnitude.
>
> I didn't measure pps in our benchmarking, but did manage to saturate a
> 10GE link from a VM (actually we did this on 10 nodes at a time to saturate
> a 100GE wide area link), and all of those settings are here:
>
> http://buriedlede.blogspot.com/2012/11/driving-100-gigabit-network-with.html
>
> I'd start trying to do some fault isolation; see if you can get NAT out of
> the mix, for example, or see if it is a network stack tuning problem. You
> probably need to crank up some of your buffer sizes, even if you don't need
> to mess with your TCP windows.
>
> Can you actually saturate your 2x1ge lag with bandwidth? (single or ganged
> flows?)
>  -nld
>
>
> On Tue, Jan 14, 2014 at 3:52 PM, Alejandro Comisario <
> alejandro.comisario at mercadolibre.com> wrote:
>
>> Wow, its kinda hard to imagine we are the only ones that have only
>> 100Mb/s bandwidth but 50.000 requests per minute on each compute, i mean,
>> lots of throughput, almost none bandwith.
>>
>> Everyone has their networking performance figured out ?
>> No one to share some "SUPER THROUGHPUT" sysctl / ethtool / power / etc
>> settings on the compute side ?
>>
>> Best regards.
>>
>> * alejandrito*
>>
>> On Sat, Jan 11, 2014 at 4:12 PM, Alejandro Comisario <
>> alejandro.comisario at mercadolibre.com> wrote:
>>
>>> Well, its been a long time since we use nova with KVM, we got over the
>>> many thousand vms, and still, something doesnt feel right.
>>> We are using ubuntu 12.04 kernel 3.2.0-[40-48], tuned sysctl with lots
>>> of parameters, and everything ... works, you can say, quite well.
>>>
>>> But here's the deal, we have an special networking scenario that is,
>>> EVERYTHING IS APIS, everything is throughput, no bandwidth.
>>> Every 2x1Gb bonded compute node, doesnt get over the [200Mb/s - 400Mb/s]
>>> but its handling hundreds of thousands requests per minute to the vms.
>>>
>>> And once in a while, gives you the sensation that everything goes to
>>> hell, timeouts from aplications over there, response times from apis going
>>> from 10ms to 200ms over there, 20ms delays happening between the vm ETH0
>>> and the VNET interface, etc.
>>> So, since its a massive scenario to tune, we never kinda, nailedon WHERE
>>> TO give this 1, 2 or 3 final buffer/ring/affinity tune to make everything
>>> work from the compute side.
>>>
>>> I know its a little awkward, but im craving, and jaunting for community
>>> real life examples regarding "HIGH THROUGHPUT" tuning with KVM scenarios,
>>> dark linux or if someone can help me go through configurations that might
>>> sound weird / unnecesary / incorrect.
>>>
>>> For those who are wondering, well ... i dont know what you have, lets
>>> start with this.
>>>
>>> COMPUTE NODES (99% of them, different vendors, but ...)
>>> * 128/256 GB of ram
>>> * 2 hexacores with HT enabled
>>> * 2x1Gb bonded interfaces (want to know the more than 20 models we are
>>> using, just ask for it)
>>> * Multi queue interfaces, pined via irq to different cores
>>> * ubuntu 12.04 kernel 3.2.0-[40-48]
>>> * Linux bridges,  no VLAN, no open-vswitch
>>>
>>> I want to try to keep the networking appliances ( TOR's, AGGR, CORES )
>>> as out of the picture as possible.
>>> im thinking "i hope this thread gets great, in time"
>>>
>>> So, ready to learn as much as i can.
>>> Thank you openstack community, as allways.
>>>
>>> alejandrito
>>>
>>>
>>
>> _______________________________________________
>> OpenStack-operators mailing list
>> OpenStack-operators at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20140115/5b903ce1/attachment.html>


More information about the OpenStack-operators mailing list