Open Stack

Fri Oct 25 21:37:41 UTC 2013

WOW!! Thank you for your time Rick! Awesome answer!!    =D

I'll do this tests (with ethtool GRO / CKO) tonight but, do you think that
this is the main root of the problem?!

I mean, I'm seeing two distinct problems here:

1- Slow connectivity to the External network plus SSH lags all over the
cloud (everything that pass trough L3 / Namespace is problematic), and;

2- Communication between two Instances on different hypervisors (i.e. maybe
it is related to this GRO / CKO thing).

So, two different problems, right?!

Thanks!
Thiago

On 25 October 2013 18:56, Rick Jones <rick.jones2 at hp.com> wrote:

> > Listen, maybe this sounds too dumb from my part but, it is the first
> > time I'm talking about this stuff (like "NIC peer-into GRE" ?, or GRO
> > / CKO...
>
> No worries.
>
> So, a slightly brief history of stateless offloads in NICs.  It may be
> too basic, and I may get some details wrong, but it should give the gist.
>
> Go back to the "old days" - 10 Mbit/s Ethernet was "it" (all you Token
> Ring fans can keep quiet :).   Systems got faster than 10 Mbit/s.  By a
> fair margin.  100 BT came out and it wasn't all that long before systems
> were faster than that, but things like interrupt rates were starting to
> get to be an issue for performance, so 100 BT NICs started implementing
> interrupt avoidance heuristics.   The next bump in network speed to 1000
> Mbit/s managed to get well out ahead of the systems.  All this time,
> while the link speeds were increasing, the IEEE was doing little to
> nothing to make sending and receiving Ethernet traffic any easier on the
> end stations (eg increasing the MTU).  It was taking just as many CPU
> cycles to send/receive a frame over 1000BT as it did over 100BT as it
> did over 10BT.
>
> <insert segque about how FDDI was doing things to make life easier, as
> well as what the FDDI NIC vendors were doing to enable copy-free
> networking, here>
>
> So the Ethernet NIC vendors started getting creative and started
> borrowing some techniques from FDDI.  The base of it all is CKO -
> ChecKsum Offload. Offloading the checksum calculation for the TCP and
> UDP checksums. In broad handwaving terms, for inbound packets, the NIC
> is made either smart enough to recognize an incoming frame as TCP
> segment (UDP datagram) or it performs the Internet Checksum across the
> entire frame and leaves it to the driver to fixup.  For outbound
> traffic, the stack, via the driver, tells the NIC a starting value
> (perhaps), where to start computing the checksum, how far to go, and
> where to stick it...
>
> So, we can save the CPU cycles used calculating/verifying the checksums.
>  In rough terms, in the presence of copies, that is perhaps 10% or 15%
> savings.  Systems still needed more.  It was just as many trips up and
> down the protocol stack in the host to send a MB of data as it was
> before - the IEEE hanging-on to the 1500 byte MTU.  So, some NIC vendors
> came-up with Jumbo Frames - I think the first may have been Alteon and
> their AceNICs and switches.   A 9000 byte MTU allows one to send bulk
> data across the network in ~1/6 the number of trips up and down the
> protocol stack.   But that has problems - in particular you have to have
> support for Jumbo Frames from end to end.
>
> So someone, I don't recall who, had the flash of inspiration - What
> If...  the NIC could perform the TCP segmentation on behalf of the
> stack?  When sending a big chunk of data over TCP in one direction, the
> only things which change from TCP segment to TCP segment are the
> sequence number, and the checksum <insert some handwaving about the IP
> datagram ID here>.  The NIC already knows how to compute the checksum,
> so let's teach it how to very simply increment the TCP sequence number.
>  Now we can give it A Lot of Data (tm) in one trip down the protocol
> stack and save even more CPU cycles than Jumbo Frames.  Now the NIC has
> to know a little bit more about the traffic - it has to know that it is
> TCP so it can know where the TCP sequence number goes.  We also tell it
> the MSS to use when it is doing the segmentation on our behalf.  Thus
> was born TCP Segmentation Offload, aka TSO or "Poor Man's Jumbo Frames"
>
> That works pretty well for servers at the time - they tend to send more
> data than they receive.  The clients receiving the data don't need to be
> able to keep up at 1000 Mbit/s and the server can be sending to multiple
> clients.  However, we get another order of magnitude bump in link
> speeds, to 10000 Mbit/s.  Now  people need/want to receive at the higher
> speeds too.  So some 10 Gbit/s NIC vendors come up with the mirror image
> of TSO and call it LRO - Large Receive Offload.   The LRO NIC will
> coalesce several, consequtive TCP segments into one uber segment and
> hand that to the host. There are some "issues" with LRO though - for
> example when a system is acting as a router, so in Linux, and perhaps
> other stacks, LRO is taken out of the hands of the NIC and given to the
> stack in the form of 'GRO" - Generic Receive Offload.  GRO operates
> above the NIC/driver, but below IP.   It detects the consecutive
> segments and coalesces them before passing them further up the stack. It
> becomes possible to receive data at link-rate over 10 GbE.  All is
> happiness and joy.
>
> OK, so now we have all these "stateless" offloads that know about the
> basic traffic flow.  They are all built on the foundation of CKO.  They
> are all dealing with *un* encapsulated traffic.  (They also don't to
> anything for small packets.)
>
> Now, toss-in some encapsulation.  Take your pick, in the abstract it
> doesn't really matter which I suspect, at least for a little longer.
> What is arriving at the NIC on inbound is no longer a TCP segment in an
> IP datagram in an Ethernet frame, it is all that wrapped-up in the
> encapsulation protocol.  Unless the NIC knows about the encapsulation
> protocol, all the NIC knows it has is some slightly alien packet.  It
> will probably know it is IP, but it won't know more than that.
>
> It could, perhaps, simply compute an Internet Checksum across the entire
> IP datagram and leave it to the driver to fix-up.  It could simply punt
> and not perform any CKO at all.  But CKO is the foundation of the
> stateless offloads.  So, certainly no LRO and (I think but could be
> wrong) no GRO.  (At least not until the Linux stack learns how to look
> beyond the encapsulation headers.)
>
> Similarly, consider the outbound path.  We could change the constants we
> tell the NIC for doing CKO perhaps, but unless it knows about the
> encapsulation protocol, we cannot ask it to do the TCP segmentation of
> TSO - it would have to start replicating not only the TCP and IP
> headers, but also the headers of the encapsulation protocol.  So, there
> goes TSO.
>
> In essence, using an encapsulation protocol takes us all the way back to
> the days of 100BT in so far as stateless offloads are concerned.
> Perhaps to the early days of 1000BT.
>
> We do have a bit more CPU grunt these days,  but for the last several
> years that has come primarily in the form of more cores per processor,
> not in the form of processors with higher and higher frequencies.  In
> broad handwaving terms, single-threaded performance is not growing all
> that much.  If at all.
>
> That is why we have things like multiple queues per NIC port now and
> Receive Side Scaling (RSS) or Receive Packet Scaling/Receive Flow
> Scaling in Linux (or Inbound Packet Scheduling/Thread Optimized Packet
> Scheduling in HP-UX etc etc).  RSS works by having the NIC compute a
> hash over selected headers of the arriving packet - perhaps the source
> and destination MAC addresses, perhaps the source and destination IP
> addresses, and perhaps the source and destination TCP ports.  But now
> the arrving traffic is all wrapped up in this encapsulation protocol
> that the NIC might not know about.  Over what should the NIC compute the
> hash with which to pick the queue that then picks the CPU to interrupt?
>  It may just punt and send all the traffic up one queue.
>
> There are similar sorts of hashes being computed at either end of a
> bond/aggregate/trunk.  And the switches or bonding drivers making those
> calculations may not know about the encapsulation protocol, so they may
> not be able to spread traffic across multiple links.   The information
> they used to use is now hidden from them by the encapsulation protocol.
>
> That then is what I was getting at when talking about NICs peering into
> GRE.
>
> rick jones
> All I want for Christmas is a 32 bit VLAN ID and NICs and switches which
> understand it... :)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20131025/5f0d925e/attachment.html>

Open Stack

[Openstack] Directional network performance issues with Neutron + OpenvSwitch

OpenStack

Community

Documentation

Branding & Legal