[Openstack] [Openstack-operators] Extreme network throughput tuning with KVM as hypervisor

Alejandro Comisario alejandro.comisario at mercadolibre.com
Wed Jan 15 16:41:07 UTC 2014


Like i said, while 12 vms are runing on this hosts, im conducting this test
only from the hosts, not the vms, since i can see the delay on the host
side already.
I really want to make the delay go away on the host first, and once solved,
going down the vnet -> vm:eth0.

Obviously the ethX tuning (ethtool are done on the host) but the sysctl
tuning is done on the host and the guest also.
I was thinking about the bonding being one culprit, but i want to test
buffer settings first (queues and ring parameters).
regarding bonding, what kind of problems you had ? are you now running
direct eth0 -> br100 settings ?

Is it vhost_net any good ? We are just using virtio.

thanks!





On Wed, Jan 15, 2014 at 1:32 PM, Narayan Desai <narayan.desai at gmail.com>wrote:

> Are you using virtio, and vhost_net?
>
> Also, where are you tuning those parameters, host or guest? The ethernet
> level ones will definitely need to be done in the host, but the TCP and
> socket buffer ones need to be in the guest.
>
> Also, these buffers may be too large for 2x1ge. You might also check if
> the link aggregation is messing you up here. I've generally had problems
> with it.
>
> One last thing: how does the app run from the hypervisor? You can rule a
> lot of things out by testing that.
>  -nld
>
>
> On Wed, Jan 15, 2014 at 9:58 AM, Alejandro Comisario <
> alejandro.comisario at mercadolibre.com> wrote:
>
>> Hi Narayan, thanks fir the prompted response, let me try to give you some
>> insights.
>>  Since like you said, our setup can reach the maximum bandwidth in some
>> tests, but we cant achieve the THROUGHPUT we want, we run avg of 14vms on
>> 128GB of ram compute nodes, and while all those vms are runing, we run a
>> test between two compute nodes with a c++ aplication that sends 50 packets
>> per second (lower than our 1500bytes MTU) and waits from the response from
>> the target server lower than 10ms.
>> This test runs on the br100 interface on both compute nodes (passign
>> through eth0-eth1, bond0 and br100) and while all vms are running (using
>> high throughput low bandwidth applications) we see this simple tests
>> showing thenths of thousands responses higher than 10ms, actually 99% of
>> this slow responses are taking 20/21ms, i dont seem to find whats that
>> magic delay value means, we are starting to look traversing what interfaces
>> are adding the delay.
>>
>> Let me show you what our settings look like regarding networking (i will
>> take the vms out of the picture)
>>
>> COMPUTE HOST
>> ------------
>> 2x1Gb bonded interfaces (no jumbo frames, 1500MTU since jumbo frames are
>> a separate project)
>>
>> Ethernet ring settings on both interfaces:
>>
>> RX 256
>> TX 256
>>
>>
>> Ethernet txqueuelen on both interfaces:
>>
>> txqueuelen 1000
>>
>>
>> sysctl settings:
>>
>> net.ipv4.tcp_max_tw_buckets = 3600000
>> net.ipv4.tcp_max_syn_backlog = 30000
>> net.core.netdev_max_backlog = 50000
>> net.core.somaxconn = 16384
>> net.core.rmem_max = 16777216
>> net.core.wmem_max = 16777216
>> net.ipv4.tcp_rmem = 4096 87380 16777216
>> net.ipv4.tcp_wmem = 4096 65536 16777216
>> net.core.rmem_default = 16777216
>> net.core.wmem_default = 16777216
>> net.ipv4.tcp_congestion_control = cubic
>> net.ipv4.ip_local_port_range = 1024 65000
>> net.ipv4.tcp_fin_timeout = 5
>> net.ipv4.tcp_keepalive_time = 5
>> net.ipv4.tcp_tw_recycle = 1
>> net.ipv4.tcp_tw_reuse = 1
>> vm.swappiness = 0
>> net.ipv4.tcp_syncookies = 1
>> net.ipv4.tcp_timestamps = 1
>> net.ipv4.tcp_max_orphans = 60000
>> net.ipv4.tcp_synack_retries = 3
>> net.ipv4.tcp_ecn=1
>> net.ipv4.tcp_sack=1
>> net.ipv4.tcp_dsack=1
>> net.ipv4.route.flush = 1
>> net.ipv6.route.flush = 1
>> net.ipv4.netfilter.ip_conntrack_udp_timeout = 30
>> net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10
>> net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
>> net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
>> net.ipv4.netfilter.ip_conntrack_max = 1200000
>> net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000
>> net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
>> net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120
>> net.ipv4.tcp_keepalive_time = 90
>>
>> One other tip i can add is that allways the delay is on the RX side, this
>> means, the server responding.
>> So, we were thinking about going upper with ring or txqueuelen settings.
>>
>> Any idea ?
>>
>>
>>
>>
>>
>>
>> *Alejandro Comisario #melicloud CloudBuilders*
>> Arias 3751, Piso 7 (C1430CRG)
>> Ciudad de Buenos Aires - Argentina
>> Cel: +549(11) 15-3770-1857
>> Tel : +54(11) 4640-8443
>>
>>
>> On Wed, Jan 15, 2014 at 12:32 AM, Narayan Desai <narayan.desai at gmail.com>wrote:
>>
>>> We don't have a workload remotely like that (generally, we have a lot
>>> more demand for bandwidth, but we also generally run faster networks than
>>> that as well), but 1k pps sounds awfully low. Like low by several orders of
>>> magnitude.
>>>
>>> I didn't measure pps in our benchmarking, but did manage to saturate a
>>> 10GE link from a VM (actually we did this on 10 nodes at a time to saturate
>>> a 100GE wide area link), and all of those settings are here:
>>>
>>> http://buriedlede.blogspot.com/2012/11/driving-100-gigabit-network-with.html
>>>
>>> I'd start trying to do some fault isolation; see if you can get NAT out
>>> of the mix, for example, or see if it is a network stack tuning problem.
>>> You probably need to crank up some of your buffer sizes, even if you don't
>>> need to mess with your TCP windows.
>>>
>>> Can you actually saturate your 2x1ge lag with bandwidth? (single or
>>> ganged flows?)
>>>  -nld
>>>
>>>
>>> On Tue, Jan 14, 2014 at 3:52 PM, Alejandro Comisario <
>>> alejandro.comisario at mercadolibre.com> wrote:
>>>
>>>> Wow, its kinda hard to imagine we are the only ones that have only
>>>> 100Mb/s bandwidth but 50.000 requests per minute on each compute, i mean,
>>>> lots of throughput, almost none bandwith.
>>>>
>>>> Everyone has their networking performance figured out ?
>>>> No one to share some "SUPER THROUGHPUT" sysctl / ethtool / power / etc
>>>> settings on the compute side ?
>>>>
>>>> Best regards.
>>>>
>>>> * alejandrito*
>>>>
>>>> On Sat, Jan 11, 2014 at 4:12 PM, Alejandro Comisario <
>>>> alejandro.comisario at mercadolibre.com> wrote:
>>>>
>>>>> Well, its been a long time since we use nova with KVM, we got over the
>>>>> many thousand vms, and still, something doesnt feel right.
>>>>> We are using ubuntu 12.04 kernel 3.2.0-[40-48], tuned sysctl with lots
>>>>> of parameters, and everything ... works, you can say, quite well.
>>>>>
>>>>> But here's the deal, we have an special networking scenario that is,
>>>>> EVERYTHING IS APIS, everything is throughput, no bandwidth.
>>>>> Every 2x1Gb bonded compute node, doesnt get over the [200Mb/s -
>>>>> 400Mb/s] but its handling hundreds of thousands requests per minute to the
>>>>> vms.
>>>>>
>>>>> And once in a while, gives you the sensation that everything goes to
>>>>> hell, timeouts from aplications over there, response times from apis going
>>>>> from 10ms to 200ms over there, 20ms delays happening between the vm ETH0
>>>>> and the VNET interface, etc.
>>>>> So, since its a massive scenario to tune, we never kinda, nailedon
>>>>> WHERE TO give this 1, 2 or 3 final buffer/ring/affinity tune to make
>>>>> everything work from the compute side.
>>>>>
>>>>> I know its a little awkward, but im craving, and jaunting for
>>>>> community real life examples regarding "HIGH THROUGHPUT" tuning with KVM
>>>>> scenarios, dark linux or if someone can help me go through configurations
>>>>> that might sound weird / unnecesary / incorrect.
>>>>>
>>>>> For those who are wondering, well ... i dont know what you have, lets
>>>>> start with this.
>>>>>
>>>>> COMPUTE NODES (99% of them, different vendors, but ...)
>>>>> * 128/256 GB of ram
>>>>> * 2 hexacores with HT enabled
>>>>> * 2x1Gb bonded interfaces (want to know the more than 20 models we are
>>>>> using, just ask for it)
>>>>> * Multi queue interfaces, pined via irq to different cores
>>>>> * ubuntu 12.04 kernel 3.2.0-[40-48]
>>>>> * Linux bridges,  no VLAN, no open-vswitch
>>>>>
>>>>> I want to try to keep the networking appliances ( TOR's, AGGR, CORES )
>>>>> as out of the picture as possible.
>>>>> im thinking "i hope this thread gets great, in time"
>>>>>
>>>>> So, ready to learn as much as i can.
>>>>> Thank you openstack community, as allways.
>>>>>
>>>>> alejandrito
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> OpenStack-operators mailing list
>>>> OpenStack-operators at lists.openstack.org
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20140115/9f8e5bb9/attachment.html>


More information about the Openstack mailing list