[openstack-dev] [tc] supporting Go

Ben Meyer ben.meyer at rackspace.com
Wed May 11 19:36:39 UTC 2016

On 05/11/2016 03:23 PM, Samuel Merritt wrote:
> On 5/11/16 7:09 AM, Thomas Goirand wrote:
>> On 05/10/2016 09:56 PM, Samuel Merritt wrote:
>>> On 5/9/16 5:21 PM, Robert Collins wrote:
>>>> On 10 May 2016 at 10:54, John Dickinson <me at not.mn> wrote:
>>>>> On 9 May 2016, at 13:16, Gregory Haynes wrote:
>>>>>> This is a bit of an aside but I am sure others are wondering the
>>>>>> same
>>>>>> thing - Is there some info (specs/etherpad/ML thread/etc) that
>>>>>> has more
>>>>>> details on the bottleneck you're running in to? Given that the only
>>>>>> clients of your service are the public facing DNS servers I am
>>>>>> now even
>>>>>> more surprised that you're hitting a python-inherent bottleneck.
>>>>> In Swift's case, the summary is that it's hard[0] to write a network
>>>>> service in Python that shuffles data between the network and a block
>>>>> device (hard drive) and effectively utilizes all of the hardware
>>>>> available. So far, we've done very well by fork()'ing child
>>>>> processes,
>>>> ...
>>>>> Initial results from a golang reimplementation of the object
>>>>> server in
>>>>> Python are very positive[1]. We're not proposing to rewrite Swift
>>>>> entirely in Golang. Specifically, we're looking at improving object
>>>>> replication time in Swift. This service must discover what data is on
>>>>> a drive, talk to other servers in the cluster about what they have,
>>>>> and coordinate any data sync process that's needed.
>>>>> [0] Hard, not impossible. Of course, given enough time, we can do
>>>>>  anything in a Turing-complete language, right? But we're not talking
>>>>>  about possible, we're talking about efficient tools for the job at
>>>>>  hand.
>>>> ...
>>>> I'm glad you're finding you can get good results in (presumably)
>>>> clean, understandable code.
>>>> Given go's historically poor perfornance with multiple cores
>>>> (https://golang.org/doc/faq#Why_GOMAXPROCS) I'm going to presume the
>>>> major advantage is in the CSP programming model - something that
>>>> Twisted does very well: and frustratingly we've had numerous
>>>> discussions from folk in the Twisted world who see the pain we have
>>>> and want to help, but as a community we've consistently stayed with
>>>> eventlet, which has a threaded programming model - and threaded models
>>>> are poorly suited for the case here.
>>> At its core, the problem is that filesystem IO can take a surprisingly
>>> long time, during which the calling thread/process is blocked, and
>>> there's no good asynchronous alternative.
>>> Some background:
>>> With Eventlet, when your greenthread tries to read from a socket and
>>> the
>>> socket is not readable, then recvfrom() returns -1/EWOULDBLOCK; then,
>>> the Eventlet hub steps in, unschedules your greenthread, finds an
>>> unblocked one, and lets it proceed. It's pretty good at servicing a
>>> bunch of concurrent connections and keeping the CPU busy.
>>> On the other hand, when the socket is readable, then recvfrom() returns
>>> quickly (a few microseconds). The calling process was technically
>>> blocked, but the syscall is so fast that it hardly matters.
>>> Now, when your greenthread tries to read from a file, that read() call
>>> doesn't return until the data is in your process's memory. This can
>>> take
>>> a surprisingly long time. If the data isn't in buffer cache and the
>>> kernel has to go fetch it from a spinning disk, then you're looking
>>> at a
>>> seek time of ~7 ms, and that's assuming there are no other pending
>>> requests for the disk.
>>> There's no EWOULDBLOCK when reading from a plain file, either. If the
>>> file pointer isn't at EOF, then the calling process blocks until the
>>> kernel fetches data for it.
>>> Back to Swift:
>>> The Swift object server basically does two things: it either reads from
>>> a disk and writes to a socket or vice versa. There's a little HTTP
>>> parsing in there, but the vast majority of the work is shuffling bytes
>>> between network and disk. One Swift object server can service many
>>> clients simultaneously.
>>> The problem is those pauses due to read(). If your process is servicing
>>> hundreds of clients reading from and writing to dozens of disks (in,
>>> say, a 48-disk 4U server), then all those little 7 ms waits are pretty
>>> bad for throughput. Now, a lot of the time, the kernel does some
>>> readahead so your read() calls can quickly return data from buffer
>>> cache, but there are still lots of little hitches.
>>> But wait: it gets worse. Sometimes a disk gets slow. Maybe it's got a
>>> lot of pending IO requests, maybe its filesystem is getting close to
>>> full, or maybe the disk hardware is just starting to get flaky. For
>>> whatever reason, IO to this disk starts taking a lot longer than 7
>>> ms on
>>> average; think dozens or hundreds of milliseconds. Now, every time your
>>> process tries to read from this disk, all other work stops for quite a
>>> long time. The net effect is that the object server's throughput
>>> plummets while it spends most of its time blocked on IO from that one
>>> slow disk.
>>> Now, of course there's things we can do. The obvious one is to use a
>>> couple of IO threads per disk and push the blocking syscalls out
>>> there... and, in fact, Swift did that. In commit b491549, the object
>>> server gained a small threadpool for each disk[1] and started doing its
>>> IO there.
>>> This worked pretty well for avoiding the slow-disk problem. Requests
>>> that touched the slow disk would back up, but requests for the other
>>> disks in the server would proceed at a normal pace. Good, right?
>>> The problem was all the threadpool overhead. Remember, a significant
>>> fraction of the time, write() and read() only touch buffer cache, so
>>> the
>>> syscalls are very fast. Adding in the threadpool overhead in Python
>>> slowed those down. Yes, if you were hit with a 7 ms read penalty, the
>>> threadpool saved you, but if you were reading from buffercache then you
>>> just paid a big cost for no gain.
>>> On some object-server nodes where the CPUs were already fully-utilized,
>>> people saw a 25% drop in throughput when using the Python threadpools.
>>> It's not worth that performance loss just to gain protection from slow
>>> disks.
>>> The second thing Swift tried was to run separate object-server
>>> processes
>>> for each disk [2]. This also mitigates slow disks, but it avoids the
>>> threadpool overhead. The downside here is that dense nodes end up with
>>> lots of processes; for example, a 48-disk node with 2 object servers
>>> per
>>> disk will end up with about 96 object-server processes running. While
>>> these processes aren't particularly RAM-heavy, that's still a decent
>>> chunk of memory that could have been holding directories in buffer
>>> cache.
>>> Aside: there's a few other things we looked at but rejected. Using
>>> Linux
>>> AIO (kernel AIO, not POSIX libaio) would let the object server have
>>> many
>>> pending IOs cheaply, but it only works in O_DIRECT mode, so there's no
>>> buffer cache. We also looked at the readv2() syscall to let us perform
>>> buffer-cache-only reads in the main thread and use a blocking read()
>>> syscall in a threadpool, but unfortunately readv2() and preadv2() only
>>> hit Linux in March 2016, so people running such ancient software as
>>> Ubuntu Xenial Xerus [3] can't use it.
>> Didn't you try asyncio from Python 3? Wouldn't it be helpful here?
> Unfortunately, it would not help.
> The core problem is that filesystem syscalls can take a long time and
> that they block the calling thread.
> At a syscall level, Eventlet and asyncio look pretty similar:
> (1) Call select()/epoll()/whatever to wait for something to happen on
> many file descriptors
> (2) For each ready file descriptor, do something. For example, if a
> socket fd is readable, call recvfrom(fd, buf, ...) repeatedly until
> the kernel returns -1/EWOULDBLOCK, then go to the next fd.
> (3) Repeat. (Yes, there's timed events and such, but they don't matter
> for purposes of this discussion.)
> The key thing that makes this all work is that, when select() says a
> file descriptor is readable, then reading from that file descriptor is
> extremely fast. In a matter of just a few microseconds, the program
> either gets some data to operate on or it gets -1/EWOULDBLOCK
> returned. And, of course, the same goes for writability.
> Now, with files, this breaks down. Consider a file descriptor opened
> for reading on a normal file on a filesystem on a spinning disk.
> select() says it's readable, so the program calls read(), and the
> whole thing blocks for anywhere from a few tens of microseconds up to
> entire tens of seconds.

Have you tried putting the file into non-blocking mode?
>From the above description is doesn't sound like you have.

gevent[1] supports that and provides an interface for wrapping python
file objects for doing so, at least on Linux.
This keeps the thread from being blocked and thereby allowing a
different greenlet worker to continue while the greenlet waits on its
I/O to be available.

I don't see any equivalent in eventlet[2] but that would seem like a
more worthwhile contribution to eventlet or even oslo.



[1] http://www.gevent.org/gevent.fileobject.html
[2] http://eventlet.net/doc/

More information about the OpenStack-dev mailing list