<div dir="ltr">WOW!! Thank you for your time Rick! Awesome answer!!    =D<div><br></div><div>I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that this is the main root of the problem?!</div><div><br>


</div><div><br></div><div>I mean, I'm seeing two distinct problems here:</div><div><br></div><div>1- Slow connectivity to the External network plus SSH lags all over the cloud (everything that pass trough L3 / Namespace is problematic), and;</div>


<div><br></div><div>2- Communication between two Instances on different hypervisors (i.e. maybe it is related to this GRO / CKO thing).</div><div><br></div><div><br></div><div>So, two different problems, right?!</div><div>


<br></div><div>Thanks!</div><div>Thiago</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 25 October 2013 18:56, Rick Jones <span dir="ltr"><<a href="mailto:rick.jones2@hp.com" target="_blank">rick.jones2@hp.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">> Listen, maybe this sounds too dumb from my part but, it is the first<br>

> time I'm talking about this stuff (like "NIC peer-into GRE" ?, or GRO<br>

> / CKO...<br>

<br>

</div>No worries.<br>

<br>

So, a slightly brief history of stateless offloads in NICs.  It may be<br>

too basic, and I may get some details wrong, but it should give the gist.<br>

<br>

Go back to the "old days" - 10 Mbit/s Ethernet was "it" (all you Token<br>

Ring fans can keep quiet :).   Systems got faster than 10 Mbit/s.  By a<br>

fair margin.  100 BT came out and it wasn't all that long before systems<br>

were faster than that, but things like interrupt rates were starting to<br>

get to be an issue for performance, so 100 BT NICs started implementing<br>

interrupt avoidance heuristics.   The next bump in network speed to 1000<br>

Mbit/s managed to get well out ahead of the systems.  All this time,<br>

while the link speeds were increasing, the IEEE was doing little to<br>

nothing to make sending and receiving Ethernet traffic any easier on the<br>

end stations (eg increasing the MTU).  It was taking just as many CPU<br>

cycles to send/receive a frame over 1000BT as it did over 100BT as it<br>

did over 10BT.<br>

<br>

<insert segque about how FDDI was doing things to make life easier, as<br>

well as what the FDDI NIC vendors were doing to enable copy-free<br>

networking, here><br>

<br>

So the Ethernet NIC vendors started getting creative and started<br>

borrowing some techniques from FDDI.  The base of it all is CKO -<br>

ChecKsum Offload. Offloading the checksum calculation for the TCP and<br>

UDP checksums. In broad handwaving terms, for inbound packets, the NIC<br>

is made either smart enough to recognize an incoming frame as TCP<br>

segment (UDP datagram) or it performs the Internet Checksum across the<br>

entire frame and leaves it to the driver to fixup.  For outbound<br>

traffic, the stack, via the driver, tells the NIC a starting value<br>

(perhaps), where to start computing the checksum, how far to go, and<br>

where to stick it...<br>

<br>

So, we can save the CPU cycles used calculating/verifying the checksums.<br>

 In rough terms, in the presence of copies, that is perhaps 10% or 15%<br>

savings.  Systems still needed more.  It was just as many trips up and<br>

down the protocol stack in the host to send a MB of data as it was<br>

before - the IEEE hanging-on to the 1500 byte MTU.  So, some NIC vendors<br>

came-up with Jumbo Frames - I think the first may have been Alteon and<br>

their AceNICs and switches.   A 9000 byte MTU allows one to send bulk<br>

data across the network in ~1/6 the number of trips up and down the<br>

protocol stack.   But that has problems - in particular you have to have<br>

support for Jumbo Frames from end to end.<br>

<br>

So someone, I don't recall who, had the flash of inspiration - What<br>

If...  the NIC could perform the TCP segmentation on behalf of the<br>

stack?  When sending a big chunk of data over TCP in one direction, the<br>

only things which change from TCP segment to TCP segment are the<br>

sequence number, and the checksum <insert some handwaving about the IP<br>

datagram ID here>.  The NIC already knows how to compute the checksum,<br>

so let's teach it how to very simply increment the TCP sequence number.<br>

 Now we can give it A Lot of Data (tm) in one trip down the protocol<br>

stack and save even more CPU cycles than Jumbo Frames.  Now the NIC has<br>

to know a little bit more about the traffic - it has to know that it is<br>

TCP so it can know where the TCP sequence number goes.  We also tell it<br>

the MSS to use when it is doing the segmentation on our behalf.  Thus<br>

was born TCP Segmentation Offload, aka TSO or "Poor Man's Jumbo Frames"<br>

<br>

That works pretty well for servers at the time - they tend to send more<br>

data than they receive.  The clients receiving the data don't need to be<br>

able to keep up at 1000 Mbit/s and the server can be sending to multiple<br>

clients.  However, we get another order of magnitude bump in link<br>

speeds, to 10000 Mbit/s.  Now  people need/want to receive at the higher<br>

speeds too.  So some 10 Gbit/s NIC vendors come up with the mirror image<br>

of TSO and call it LRO - Large Receive Offload.   The LRO NIC will<br>

coalesce several, consequtive TCP segments into one uber segment and<br>

hand that to the host. There are some "issues" with LRO though - for<br>

example when a system is acting as a router, so in Linux, and perhaps<br>

other stacks, LRO is taken out of the hands of the NIC and given to the<br>

stack in the form of 'GRO" - Generic Receive Offload.  GRO operates<br>

above the NIC/driver, but below IP.   It detects the consecutive<br>

segments and coalesces them before passing them further up the stack. It<br>

becomes possible to receive data at link-rate over 10 GbE.  All is<br>

happiness and joy.<br>

<br>

OK, so now we have all these "stateless" offloads that know about the<br>

basic traffic flow.  They are all built on the foundation of CKO.  They<br>

are all dealing with *un* encapsulated traffic.  (They also don't to<br>

anything for small packets.)<br>

<br>

Now, toss-in some encapsulation.  Take your pick, in the abstract it<br>

doesn't really matter which I suspect, at least for a little longer.<br>

What is arriving at the NIC on inbound is no longer a TCP segment in an<br>

IP datagram in an Ethernet frame, it is all that wrapped-up in the<br>

encapsulation protocol.  Unless the NIC knows about the encapsulation<br>

protocol, all the NIC knows it has is some slightly alien packet.  It<br>

will probably know it is IP, but it won't know more than that.<br>

<br>

It could, perhaps, simply compute an Internet Checksum across the entire<br>

IP datagram and leave it to the driver to fix-up.  It could simply punt<br>

and not perform any CKO at all.  But CKO is the foundation of the<br>

stateless offloads.  So, certainly no LRO and (I think but could be<br>

wrong) no GRO.  (At least not until the Linux stack learns how to look<br>

beyond the encapsulation headers.)<br>

<br>

Similarly, consider the outbound path.  We could change the constants we<br>

tell the NIC for doing CKO perhaps, but unless it knows about the<br>

encapsulation protocol, we cannot ask it to do the TCP segmentation of<br>

TSO - it would have to start replicating not only the TCP and IP<br>

headers, but also the headers of the encapsulation protocol.  So, there<br>

goes TSO.<br>

<br>

In essence, using an encapsulation protocol takes us all the way back to<br>

the days of 100BT in so far as stateless offloads are concerned.<br>

Perhaps to the early days of 1000BT.<br>

<br>

We do have a bit more CPU grunt these days,  but for the last several<br>

years that has come primarily in the form of more cores per processor,<br>

not in the form of processors with higher and higher frequencies.  In<br>

broad handwaving terms, single-threaded performance is not growing all<br>

that much.  If at all.<br>

<br>

That is why we have things like multiple queues per NIC port now and<br>

Receive Side Scaling (RSS) or Receive Packet Scaling/Receive Flow<br>

Scaling in Linux (or Inbound Packet Scheduling/Thread Optimized Packet<br>

Scheduling in HP-UX etc etc).  RSS works by having the NIC compute a<br>

hash over selected headers of the arriving packet - perhaps the source<br>

and destination MAC addresses, perhaps the source and destination IP<br>

addresses, and perhaps the source and destination TCP ports.  But now<br>

the arrving traffic is all wrapped up in this encapsulation protocol<br>

that the NIC might not know about.  Over what should the NIC compute the<br>

hash with which to pick the queue that then picks the CPU to interrupt?<br>

 It may just punt and send all the traffic up one queue.<br>

<br>

There are similar sorts of hashes being computed at either end of a<br>

bond/aggregate/trunk.  And the switches or bonding drivers making those<br>

calculations may not know about the encapsulation protocol, so they may<br>

not be able to spread traffic across multiple links.   The information<br>

they used to use is now hidden from them by the encapsulation protocol.<br>

<br>

That then is what I was getting at when talking about NICs peering into GRE.<br>

<span class="HOEnZb"><font color="#888888"><br>

rick jones<br>

All I want for Christmas is a 32 bit VLAN ID and NICs and switches which<br>

understand it... :)<br>

</font></span></blockquote></div><br></div>