Open Stack

Mon May 21 17:18:50 UTC 2012

On 05/19/2012 06:34 PM, Paulo Ricardo Motta Gomes wrote:
> Hello,
>
> I'm doing some experiments in a Swift cluster testbed of 9 nodes/devices
> and 3 zones (3 nodes on each zone).
>
> In one of my tests, I noticed that PUTs of very small objects are
> extremely inefficient.
>
> - 5000 PUTs of objects with an average size of 40K - total of 195MB -
> took 67s (avg time per request: 0.0135s)
> - 5000 PUTS of objects with an average size of 190 bytes - total of
> 930KB - took 60s (avg time per request: 0.0123s)
>
> I plotted object size vs request time and found that there is
> significant difference in request times only after 200KB. When objects
> are smaller than this PUT requests have a minimum execution time of
> 0.01s, no matter the object size.
> I suppose swift is not optimized for such small objects, but I wonder
> what is the main cause for this, if it's the HTTP overhead or disk
> writing. I checked the log of the object servers and requests are taking
> an average of 0.006s, whether objects are 40K or 190 bytes, which
> indicate part of the bottleneck could be at the disk. Curently I'm using
> a loopback device for storage.
>
> I thought that maybe this could be improved a bit if the proxy server
> maintained persistent connections to the storage nodes instead of
> opening a new one for each request?

Persistent TCP connections do nothing to improve disc performance, so it 
would be unlikely to have much of an effect if indeed the bulk of the 
response time is in getting things to disc on the server.

If you take a tcpdump trace you would be able to see how much of the 
time to service a request is taken-up by the TCP three-way handshake. 
You would take the time from the client's send of the SYNchronize 
segment to the time the SYN|ACKnowledgement segment arrives from the 
server.  You can also then look at the time between when the last byte 
of the request is sent and the first/last byte of the response arrives.

If you wanted, you could probably compare that with the likes of say a 
netperf TCP_CRR test and see it with something that is only TCP.  The 
netperf TCP_CRR test will open a tcp connection, send a request of a 
given size, receive a response of a given size and the connection is closed.

For an example, here is a TCP_CRR test between my laptop and my 
workstation, (100 Mbit/s) with netperf configured to include histogram 
support via configure --enable-histogram.  I just picked a number out of 
the ether for the sizes:

raj at tardy:~/netperf2_trunk$ src/netperf -H 
raj-8510w.americas.hpqcorp.net -t TCP_CRR -l 30 -v 2 -- -r `expr 190 + 
128`,128
MIGRATED TCP Connect/Request/Response TEST from 0.0.0.0 (0.0.0.0) port 0 
AF_INET to internal-host.americas.hpqcorp.net () port 0 AF_INET : histogram
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  318      128     30.00    2340.51
16384  87380

Alignment      Offset
Local  Remote  Local  Remote
Send   Recv    Send   Recv
     8      8       0      0

Histogram of request/response times
UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_USEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_USEC  :    0:    0:    0: 18238: 47290: 3945:  486:  128:   34:   20
UNIT_MSEC     :    0:   47:   14:    1:    6:    2:    2:    0:    1:    1
TEN_MSEC      :    0:    0:    1:    0:    0:    0:    0:    0:    0:    0
HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
 >100_SECS: 0
HIST_TOTAL:      70216

So, on average it was 1/2340 or 0.00043 seconds per complete transaction 
just doing things at the TCP level, without any openstack code or disc 
I/O etc.  There was a tail, too short to be TCP retransmissions, my bet 
would be task scheduling as my workstation was not otherwise idle.

In the midst of all that I did:

raj at tardy:~/netperf2_trunk$ sudo tcpdump -i eth0 -c 20 -w /tmp/example.pcap

and post-processing that with:

sudo tcpdump -i eth0 -r /tmp/example.pcap -ttt

here is one of those transactions:

raj at tardy:~/netperf2_trunk$ sudo tcpdump -i eth0 -r /tmp/example.pcap 
-ttt port 63978
reading from file /tmp/example.pcap, link-type EN10MB (Ethernet)
00:00:00.000000 IP tardy.63978 > 
internal-host.americas.hpqcorp.net.35594: Flags [SEW], seq 1709857287, 
win 14600, options [mss 1460,sackOK,TS val 111970417 ecr 0,nop,wscale 
7], length 0
00:00:00.000097 IP internal-host.americas.hpqcorp.net.35594 > 
tardy.63978: Flags [S.E], seq 3540302987, ack 1709857288, win 14480, 
options [mss 1460,sackOK,TS val 428732549 ecr 111970417,nop,wscale 5], 
length 0
00:00:00.000007 IP tardy.63978 > 
internal-host.americas.hpqcorp.net.35594: Flags [.], ack 1, win 115, 
options [nop,nop,TS val 111970417 ecr 428732549], length 0
00:00:00.000010 IP tardy.63978 > 
internal-host.americas.hpqcorp.net.35594: Flags [P.], seq 1:319, ack 1, 
win 115, options [nop,nop,TS val 111970417 ecr 428732549], length 318
00:00:00.000158 IP internal-host.americas.hpqcorp.net.35594 > 
tardy.63978: Flags [.], ack 319, win 486, options [nop,nop,TS val 
428732549 ecr 111970417], length 0
00:00:00.000062 IP internal-host.americas.hpqcorp.net.35594 > 
tardy.63978: Flags [P.], seq 1:129, ack 319, win 486, options 
[nop,nop,TS val 428732549 ecr 111970417], length 128
00:00:00.000004 IP internal-host.americas.hpqcorp.net.35594 > 
tardy.63978: Flags [F.], seq 129, ack 319, win 486, options [nop,nop,TS 
val 428732549 ecr 111970417], length 0
00:00:00.000006 IP tardy.63978 > 
internal-host.americas.hpqcorp.net.35594: Flags [.], ack 129, win 123, 
options [nop,nop,TS val 111970417 ecr 428732549], length 0
00:00:00.000008 IP tardy.63978 > 
internal-host.americas.hpqcorp.net.35594: Flags [F.], seq 319, ack 130, 
win 123, options [nop,nop,TS val 111970417 ecr 428732549], length 0

It was in this LAN-based case a very short time between the SYN (the 
first line with "SEW" in it (the EW relates to Explicit Congestion 
Notification, which is on on my systems) and the  second line, which is 
the SYN|ACK (and acceptance of ecn for this connection).  I used "port 
63978 as a packet filter to let the relative timestamps be helpful - 
otherwise, a last ACK from a previous connection would have been in there.

Anyway, it was only 97+7 microseconds (0.000104 s) before the connection 
was established from the point-of-view of the client, and only another 
10 microseconds after that before the request was on its way. (Strictly 
speaking, past the tracepoint in the stack running on the client, but I 
rather doubt it was much longer before it was actually through the NIC 
and on the wire)

> It would be great if you could share your thoughts on this and how could
> the performance of this special case be improved.

You should repeat the above (at least the tcpdump of your actual PUTs if 
not also the pure netperf test) on your setup.  I am as much a fan of 
persistent connections as anyone but I think you will find that TCP 
connection setup overhead wall clock time, is not the biggest component 
of the response time floor you have found.

That was suggesting tcpdump on the client.  If you also take a tcpdump 
trace at the server, you can see the length of time inside the server 
between when the request arrived off the wire, and when the response was 
sent (queued to the driver at least).

Caveat - you should not try to do math between absolute timestamps on 
the client and on the server unless you know that the two systems have 
really, Really, REALLY well synchronized clocks...

rick jones
http://www.netperf.org/

If you are still reading, for grins here is the TCP_RR for the same 
sizes.  Just no connection establishment overhead:
raj at tardy:~/netperf2_trunk$ src/netperf -H 
raj-8510w.americas.hpqcorp.net -t TCP_RR -l 30 -v 2 -- -r `expr 190 + 
128`,128
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to internal-host.americas.hpqcorp.net () port 0 AF_INET : histogram : 
first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  318      128     30.00    4114.35
16384  87380
Alignment      Offset         RoundTrip  Trans    Throughput
Local  Remote  Local  Remote  Latency    Rate     10^6bits/s
Send   Recv    Send   Recv    usec/Tran  per sec  Outbound   Inbound
     8      0       0      0   243.052   4114.347 10.467    4.213

Histogram of request/response times
UNIT_USEC     :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_USEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_USEC  :    0:  774: 117564: 4923:  102:   25:    9:    4:    4:    7
UNIT_MSEC     :    0:   14:    2:    1:    0:    0:    0:    1:    1:    1
TEN_MSEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
HUNDRED_MSEC  :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
UNIT_SEC      :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
TEN_SEC       :    0:    0:    0:    0:    0:    0:    0:    0:    0:    0
 >100_SECS: 0
HIST_TOTAL:      123432

Certainly, when/if there is very little service time in the server, a 
persistent connection will be "faster" - it will send rather fewer 
segments back and forth per transaction.  But if the service times are 
rather larger than the network round-trip-times involved the gains from 
a persistent connection will be minimized.

Open Stack

[Openstack] Swift performance for very small objects

OpenStack

Community

Documentation

Branding & Legal