Open Stack

Wed May 11 14:09:33 UTC 2016

On 05/10/2016 09:56 PM, Samuel Merritt wrote:
> On 5/9/16 5:21 PM, Robert Collins wrote:
>> On 10 May 2016 at 10:54, John Dickinson <me at not.mn> wrote:
>>> On 9 May 2016, at 13:16, Gregory Haynes wrote:
>>>>
>>>> This is a bit of an aside but I am sure others are wondering the same
>>>> thing - Is there some info (specs/etherpad/ML thread/etc) that has more
>>>> details on the bottleneck you're running in to? Given that the only
>>>> clients of your service are the public facing DNS servers I am now even
>>>> more surprised that you're hitting a python-inherent bottleneck.
>>>
>>> In Swift's case, the summary is that it's hard[0] to write a network
>>> service in Python that shuffles data between the network and a block
>>> device (hard drive) and effectively utilizes all of the hardware
>>> available. So far, we've done very well by fork()'ing child processes,
>> ...
>>> Initial results from a golang reimplementation of the object server in
>>> Python are very positive[1]. We're not proposing to rewrite Swift
>>> entirely in Golang. Specifically, we're looking at improving object
>>> replication time in Swift. This service must discover what data is on
>>> a drive, talk to other servers in the cluster about what they have,
>>> and coordinate any data sync process that's needed.
>>>
>>> [0] Hard, not impossible. Of course, given enough time, we can do
>>>  anything in a Turing-complete language, right? But we're not talking
>>>  about possible, we're talking about efficient tools for the job at
>>>  hand.
>> ...
>>
>> I'm glad you're finding you can get good results in (presumably)
>> clean, understandable code.
>>
>> Given go's historically poor perfornance with multiple cores
>> (https://golang.org/doc/faq#Why_GOMAXPROCS) I'm going to presume the
>> major advantage is in the CSP programming model - something that
>> Twisted does very well: and frustratingly we've had numerous
>> discussions from folk in the Twisted world who see the pain we have
>> and want to help, but as a community we've consistently stayed with
>> eventlet, which has a threaded programming model - and threaded models
>> are poorly suited for the case here.
> 
> At its core, the problem is that filesystem IO can take a surprisingly
> long time, during which the calling thread/process is blocked, and
> there's no good asynchronous alternative.
> 
> Some background:
> 
> With Eventlet, when your greenthread tries to read from a socket and the
> socket is not readable, then recvfrom() returns -1/EWOULDBLOCK; then,
> the Eventlet hub steps in, unschedules your greenthread, finds an
> unblocked one, and lets it proceed. It's pretty good at servicing a
> bunch of concurrent connections and keeping the CPU busy.
> 
> On the other hand, when the socket is readable, then recvfrom() returns
> quickly (a few microseconds). The calling process was technically
> blocked, but the syscall is so fast that it hardly matters.
> 
> Now, when your greenthread tries to read from a file, that read() call
> doesn't return until the data is in your process's memory. This can take
> a surprisingly long time. If the data isn't in buffer cache and the
> kernel has to go fetch it from a spinning disk, then you're looking at a
> seek time of ~7 ms, and that's assuming there are no other pending
> requests for the disk.
> 
> There's no EWOULDBLOCK when reading from a plain file, either. If the
> file pointer isn't at EOF, then the calling process blocks until the
> kernel fetches data for it.
> 
> Back to Swift:
> 
> The Swift object server basically does two things: it either reads from
> a disk and writes to a socket or vice versa. There's a little HTTP
> parsing in there, but the vast majority of the work is shuffling bytes
> between network and disk. One Swift object server can service many
> clients simultaneously.
> 
> The problem is those pauses due to read(). If your process is servicing
> hundreds of clients reading from and writing to dozens of disks (in,
> say, a 48-disk 4U server), then all those little 7 ms waits are pretty
> bad for throughput. Now, a lot of the time, the kernel does some
> readahead so your read() calls can quickly return data from buffer
> cache, but there are still lots of little hitches.
> 
> But wait: it gets worse. Sometimes a disk gets slow. Maybe it's got a
> lot of pending IO requests, maybe its filesystem is getting close to
> full, or maybe the disk hardware is just starting to get flaky. For
> whatever reason, IO to this disk starts taking a lot longer than 7 ms on
> average; think dozens or hundreds of milliseconds. Now, every time your
> process tries to read from this disk, all other work stops for quite a
> long time. The net effect is that the object server's throughput
> plummets while it spends most of its time blocked on IO from that one
> slow disk.
> 
> Now, of course there's things we can do. The obvious one is to use a
> couple of IO threads per disk and push the blocking syscalls out
> there... and, in fact, Swift did that. In commit b491549, the object
> server gained a small threadpool for each disk[1] and started doing its
> IO there.
> 
> This worked pretty well for avoiding the slow-disk problem. Requests
> that touched the slow disk would back up, but requests for the other
> disks in the server would proceed at a normal pace. Good, right?
> 
> The problem was all the threadpool overhead. Remember, a significant
> fraction of the time, write() and read() only touch buffer cache, so the
> syscalls are very fast. Adding in the threadpool overhead in Python
> slowed those down. Yes, if you were hit with a 7 ms read penalty, the
> threadpool saved you, but if you were reading from buffercache then you
> just paid a big cost for no gain.
> 
> On some object-server nodes where the CPUs were already fully-utilized,
> people saw a 25% drop in throughput when using the Python threadpools.
> It's not worth that performance loss just to gain protection from slow
> disks.
> 
> 
> The second thing Swift tried was to run separate object-server processes
> for each disk [2]. This also mitigates slow disks, but it avoids the
> threadpool overhead. The downside here is that dense nodes end up with
> lots of processes; for example, a 48-disk node with 2 object servers per
> disk will end up with about 96 object-server processes running. While
> these processes aren't particularly RAM-heavy, that's still a decent
> chunk of memory that could have been holding directories in buffer cache.
> 
> 
> Aside: there's a few other things we looked at but rejected. Using Linux
> AIO (kernel AIO, not POSIX libaio) would let the object server have many
> pending IOs cheaply, but it only works in O_DIRECT mode, so there's no
> buffer cache. We also looked at the readv2() syscall to let us perform
> buffer-cache-only reads in the main thread and use a blocking read()
> syscall in a threadpool, but unfortunately readv2() and preadv2() only
> hit Linux in March 2016, so people running such ancient software as
> Ubuntu Xenial Xerus [3] can't use it.

Didn't you try asyncio from Python 3? Wouldn't it be helpful here?

Cheers,

Thomas Goirand (zigo)

Open Stack

[openstack-dev] [tc] supporting Go

OpenStack

Community

Documentation

Branding & Legal