[openstack-dev] [tc] supporting Go

Samuel Merritt sam at swiftstack.com
Wed May 11 19:23:19 UTC 2016

On 5/11/16 7:09 AM, Thomas Goirand wrote:
> On 05/10/2016 09:56 PM, Samuel Merritt wrote:
>> On 5/9/16 5:21 PM, Robert Collins wrote:
>>> On 10 May 2016 at 10:54, John Dickinson <me at not.mn> wrote:
>>>> On 9 May 2016, at 13:16, Gregory Haynes wrote:
>>>>> This is a bit of an aside but I am sure others are wondering the same
>>>>> thing - Is there some info (specs/etherpad/ML thread/etc) that has more
>>>>> details on the bottleneck you're running in to? Given that the only
>>>>> clients of your service are the public facing DNS servers I am now even
>>>>> more surprised that you're hitting a python-inherent bottleneck.
>>>> In Swift's case, the summary is that it's hard[0] to write a network
>>>> service in Python that shuffles data between the network and a block
>>>> device (hard drive) and effectively utilizes all of the hardware
>>>> available. So far, we've done very well by fork()'ing child processes,
>>> ...
>>>> Initial results from a golang reimplementation of the object server in
>>>> Python are very positive[1]. We're not proposing to rewrite Swift
>>>> entirely in Golang. Specifically, we're looking at improving object
>>>> replication time in Swift. This service must discover what data is on
>>>> a drive, talk to other servers in the cluster about what they have,
>>>> and coordinate any data sync process that's needed.
>>>> [0] Hard, not impossible. Of course, given enough time, we can do
>>>>  anything in a Turing-complete language, right? But we're not talking
>>>>  about possible, we're talking about efficient tools for the job at
>>>>  hand.
>>> ...
>>> I'm glad you're finding you can get good results in (presumably)
>>> clean, understandable code.
>>> Given go's historically poor perfornance with multiple cores
>>> (https://golang.org/doc/faq#Why_GOMAXPROCS) I'm going to presume the
>>> major advantage is in the CSP programming model - something that
>>> Twisted does very well: and frustratingly we've had numerous
>>> discussions from folk in the Twisted world who see the pain we have
>>> and want to help, but as a community we've consistently stayed with
>>> eventlet, which has a threaded programming model - and threaded models
>>> are poorly suited for the case here.
>> At its core, the problem is that filesystem IO can take a surprisingly
>> long time, during which the calling thread/process is blocked, and
>> there's no good asynchronous alternative.
>> Some background:
>> With Eventlet, when your greenthread tries to read from a socket and the
>> socket is not readable, then recvfrom() returns -1/EWOULDBLOCK; then,
>> the Eventlet hub steps in, unschedules your greenthread, finds an
>> unblocked one, and lets it proceed. It's pretty good at servicing a
>> bunch of concurrent connections and keeping the CPU busy.
>> On the other hand, when the socket is readable, then recvfrom() returns
>> quickly (a few microseconds). The calling process was technically
>> blocked, but the syscall is so fast that it hardly matters.
>> Now, when your greenthread tries to read from a file, that read() call
>> doesn't return until the data is in your process's memory. This can take
>> a surprisingly long time. If the data isn't in buffer cache and the
>> kernel has to go fetch it from a spinning disk, then you're looking at a
>> seek time of ~7 ms, and that's assuming there are no other pending
>> requests for the disk.
>> There's no EWOULDBLOCK when reading from a plain file, either. If the
>> file pointer isn't at EOF, then the calling process blocks until the
>> kernel fetches data for it.
>> Back to Swift:
>> The Swift object server basically does two things: it either reads from
>> a disk and writes to a socket or vice versa. There's a little HTTP
>> parsing in there, but the vast majority of the work is shuffling bytes
>> between network and disk. One Swift object server can service many
>> clients simultaneously.
>> The problem is those pauses due to read(). If your process is servicing
>> hundreds of clients reading from and writing to dozens of disks (in,
>> say, a 48-disk 4U server), then all those little 7 ms waits are pretty
>> bad for throughput. Now, a lot of the time, the kernel does some
>> readahead so your read() calls can quickly return data from buffer
>> cache, but there are still lots of little hitches.
>> But wait: it gets worse. Sometimes a disk gets slow. Maybe it's got a
>> lot of pending IO requests, maybe its filesystem is getting close to
>> full, or maybe the disk hardware is just starting to get flaky. For
>> whatever reason, IO to this disk starts taking a lot longer than 7 ms on
>> average; think dozens or hundreds of milliseconds. Now, every time your
>> process tries to read from this disk, all other work stops for quite a
>> long time. The net effect is that the object server's throughput
>> plummets while it spends most of its time blocked on IO from that one
>> slow disk.
>> Now, of course there's things we can do. The obvious one is to use a
>> couple of IO threads per disk and push the blocking syscalls out
>> there... and, in fact, Swift did that. In commit b491549, the object
>> server gained a small threadpool for each disk[1] and started doing its
>> IO there.
>> This worked pretty well for avoiding the slow-disk problem. Requests
>> that touched the slow disk would back up, but requests for the other
>> disks in the server would proceed at a normal pace. Good, right?
>> The problem was all the threadpool overhead. Remember, a significant
>> fraction of the time, write() and read() only touch buffer cache, so the
>> syscalls are very fast. Adding in the threadpool overhead in Python
>> slowed those down. Yes, if you were hit with a 7 ms read penalty, the
>> threadpool saved you, but if you were reading from buffercache then you
>> just paid a big cost for no gain.
>> On some object-server nodes where the CPUs were already fully-utilized,
>> people saw a 25% drop in throughput when using the Python threadpools.
>> It's not worth that performance loss just to gain protection from slow
>> disks.
>> The second thing Swift tried was to run separate object-server processes
>> for each disk [2]. This also mitigates slow disks, but it avoids the
>> threadpool overhead. The downside here is that dense nodes end up with
>> lots of processes; for example, a 48-disk node with 2 object servers per
>> disk will end up with about 96 object-server processes running. While
>> these processes aren't particularly RAM-heavy, that's still a decent
>> chunk of memory that could have been holding directories in buffer cache.
>> Aside: there's a few other things we looked at but rejected. Using Linux
>> AIO (kernel AIO, not POSIX libaio) would let the object server have many
>> pending IOs cheaply, but it only works in O_DIRECT mode, so there's no
>> buffer cache. We also looked at the readv2() syscall to let us perform
>> buffer-cache-only reads in the main thread and use a blocking read()
>> syscall in a threadpool, but unfortunately readv2() and preadv2() only
>> hit Linux in March 2016, so people running such ancient software as
>> Ubuntu Xenial Xerus [3] can't use it.
> Didn't you try asyncio from Python 3? Wouldn't it be helpful here?

Unfortunately, it would not help.

The core problem is that filesystem syscalls can take a long time and 
that they block the calling thread.

At a syscall level, Eventlet and asyncio look pretty similar:

(1) Call select()/epoll()/whatever to wait for something to happen on 
many file descriptors

(2) For each ready file descriptor, do something. For example, if a 
socket fd is readable, call recvfrom(fd, buf, ...) repeatedly until the 
kernel returns -1/EWOULDBLOCK, then go to the next fd.

(3) Repeat. (Yes, there's timed events and such, but they don't matter 
for purposes of this discussion.)

The key thing that makes this all work is that, when select() says a 
file descriptor is readable, then reading from that file descriptor is 
extremely fast. In a matter of just a few microseconds, the program 
either gets some data to operate on or it gets -1/EWOULDBLOCK returned. 
And, of course, the same goes for writability.

Now, with files, this breaks down. Consider a file descriptor opened for 
reading on a normal file on a filesystem on a spinning disk. select() 
says it's readable, so the program calls read(), and the whole thing 
blocks for anywhere from a few tens of microseconds up to entire tens of 

If the data is in the buffer cache, the kernel copies it from kernel 
memory to user memory. This is very fast.

If the data is not in buffer cache, the kernel has to get it off the 
disk. Assuming a mostly-idle disk and a reasonably small read size, 
we're looking at tens of milliseconds here. This is already too slow; 
our hypothetical program could have done a lot in 10 ms, but instead it 
was blocked waiting on a spinning disk. If the disk is busy or failing, 
that number gets even larger.

At the risk of repeating myself: the reason asynchronous IO works with 
sockets and pipes and such is the timing. select() says an fd is ready, 
and so interactions with that fd go really fast until you get 
EWOULDBLOCK, and then you go do something else. With files, select() 
says an fd is ready, but interactions with that fd might be really fast 
or excruciatingly slow or anywhere in between. You have no way of 
knowing how long things will take, and you never get the EWOULDBLOCK 
that would let you go do something else while you wait.

Unfortunately, eventlet, asyncio, and anything else single-threaded are 
all going to run into this same problem.

More information about the OpenStack-dev mailing list