[Openstack] Directional network performance issues with Neutron + OpenvSwitch
rick.jones2 at hp.com
Fri Oct 25 20:56:20 UTC 2013
> Listen, maybe this sounds too dumb from my part but, it is the first
> time I'm talking about this stuff (like "NIC peer-into GRE" ?, or GRO
> / CKO...
So, a slightly brief history of stateless offloads in NICs. It may be
too basic, and I may get some details wrong, but it should give the gist.
Go back to the "old days" - 10 Mbit/s Ethernet was "it" (all you Token
Ring fans can keep quiet :). Systems got faster than 10 Mbit/s. By a
fair margin. 100 BT came out and it wasn't all that long before systems
were faster than that, but things like interrupt rates were starting to
get to be an issue for performance, so 100 BT NICs started implementing
interrupt avoidance heuristics. The next bump in network speed to 1000
Mbit/s managed to get well out ahead of the systems. All this time,
while the link speeds were increasing, the IEEE was doing little to
nothing to make sending and receiving Ethernet traffic any easier on the
end stations (eg increasing the MTU). It was taking just as many CPU
cycles to send/receive a frame over 1000BT as it did over 100BT as it
did over 10BT.
<insert segque about how FDDI was doing things to make life easier, as
well as what the FDDI NIC vendors were doing to enable copy-free
So the Ethernet NIC vendors started getting creative and started
borrowing some techniques from FDDI. The base of it all is CKO -
ChecKsum Offload. Offloading the checksum calculation for the TCP and
UDP checksums. In broad handwaving terms, for inbound packets, the NIC
is made either smart enough to recognize an incoming frame as TCP
segment (UDP datagram) or it performs the Internet Checksum across the
entire frame and leaves it to the driver to fixup. For outbound
traffic, the stack, via the driver, tells the NIC a starting value
(perhaps), where to start computing the checksum, how far to go, and
where to stick it...
So, we can save the CPU cycles used calculating/verifying the checksums.
In rough terms, in the presence of copies, that is perhaps 10% or 15%
savings. Systems still needed more. It was just as many trips up and
down the protocol stack in the host to send a MB of data as it was
before - the IEEE hanging-on to the 1500 byte MTU. So, some NIC vendors
came-up with Jumbo Frames - I think the first may have been Alteon and
their AceNICs and switches. A 9000 byte MTU allows one to send bulk
data across the network in ~1/6 the number of trips up and down the
protocol stack. But that has problems - in particular you have to have
support for Jumbo Frames from end to end.
So someone, I don't recall who, had the flash of inspiration - What
If... the NIC could perform the TCP segmentation on behalf of the
stack? When sending a big chunk of data over TCP in one direction, the
only things which change from TCP segment to TCP segment are the
sequence number, and the checksum <insert some handwaving about the IP
datagram ID here>. The NIC already knows how to compute the checksum,
so let's teach it how to very simply increment the TCP sequence number.
Now we can give it A Lot of Data (tm) in one trip down the protocol
stack and save even more CPU cycles than Jumbo Frames. Now the NIC has
to know a little bit more about the traffic - it has to know that it is
TCP so it can know where the TCP sequence number goes. We also tell it
the MSS to use when it is doing the segmentation on our behalf. Thus
was born TCP Segmentation Offload, aka TSO or "Poor Man's Jumbo Frames"
That works pretty well for servers at the time - they tend to send more
data than they receive. The clients receiving the data don't need to be
able to keep up at 1000 Mbit/s and the server can be sending to multiple
clients. However, we get another order of magnitude bump in link
speeds, to 10000 Mbit/s. Now people need/want to receive at the higher
speeds too. So some 10 Gbit/s NIC vendors come up with the mirror image
of TSO and call it LRO - Large Receive Offload. The LRO NIC will
coalesce several, consequtive TCP segments into one uber segment and
hand that to the host. There are some "issues" with LRO though - for
example when a system is acting as a router, so in Linux, and perhaps
other stacks, LRO is taken out of the hands of the NIC and given to the
stack in the form of 'GRO" - Generic Receive Offload. GRO operates
above the NIC/driver, but below IP. It detects the consecutive
segments and coalesces them before passing them further up the stack. It
becomes possible to receive data at link-rate over 10 GbE. All is
happiness and joy.
OK, so now we have all these "stateless" offloads that know about the
basic traffic flow. They are all built on the foundation of CKO. They
are all dealing with *un* encapsulated traffic. (They also don't to
anything for small packets.)
Now, toss-in some encapsulation. Take your pick, in the abstract it
doesn't really matter which I suspect, at least for a little longer.
What is arriving at the NIC on inbound is no longer a TCP segment in an
IP datagram in an Ethernet frame, it is all that wrapped-up in the
encapsulation protocol. Unless the NIC knows about the encapsulation
protocol, all the NIC knows it has is some slightly alien packet. It
will probably know it is IP, but it won't know more than that.
It could, perhaps, simply compute an Internet Checksum across the entire
IP datagram and leave it to the driver to fix-up. It could simply punt
and not perform any CKO at all. But CKO is the foundation of the
stateless offloads. So, certainly no LRO and (I think but could be
wrong) no GRO. (At least not until the Linux stack learns how to look
beyond the encapsulation headers.)
Similarly, consider the outbound path. We could change the constants we
tell the NIC for doing CKO perhaps, but unless it knows about the
encapsulation protocol, we cannot ask it to do the TCP segmentation of
TSO - it would have to start replicating not only the TCP and IP
headers, but also the headers of the encapsulation protocol. So, there
In essence, using an encapsulation protocol takes us all the way back to
the days of 100BT in so far as stateless offloads are concerned.
Perhaps to the early days of 1000BT.
We do have a bit more CPU grunt these days, but for the last several
years that has come primarily in the form of more cores per processor,
not in the form of processors with higher and higher frequencies. In
broad handwaving terms, single-threaded performance is not growing all
that much. If at all.
That is why we have things like multiple queues per NIC port now and
Receive Side Scaling (RSS) or Receive Packet Scaling/Receive Flow
Scaling in Linux (or Inbound Packet Scheduling/Thread Optimized Packet
Scheduling in HP-UX etc etc). RSS works by having the NIC compute a
hash over selected headers of the arriving packet - perhaps the source
and destination MAC addresses, perhaps the source and destination IP
addresses, and perhaps the source and destination TCP ports. But now
the arrving traffic is all wrapped up in this encapsulation protocol
that the NIC might not know about. Over what should the NIC compute the
hash with which to pick the queue that then picks the CPU to interrupt?
It may just punt and send all the traffic up one queue.
There are similar sorts of hashes being computed at either end of a
bond/aggregate/trunk. And the switches or bonding drivers making those
calculations may not know about the encapsulation protocol, so they may
not be able to spread traffic across multiple links. The information
they used to use is now hidden from them by the encapsulation protocol.
That then is what I was getting at when talking about NICs peering into GRE.
All I want for Christmas is a 32 bit VLAN ID and NICs and switches which
understand it... :)
More information about the Openstack