[openstack-dev] [swift] Optimizing storage for small objects in Swift
Alexandre Lécuyer
alexandre.lecuyer at corp.ovh.com
Mon Jun 19 09:36:15 UTC 2017
Hello Clint,
Thanks for your feedback, replying in the email inline.
On 06/16/2017 10:54 PM, Clint Byrum wrote:
> Excerpts from John Dickinson's message of 2017-06-16 11:35:39 -0700:
>> On 16 Jun 2017, at 10:51, Clint Byrum wrote:
>>
>>> This is great work.
>>>
>>> I'm sure you've already thought of this, but could you explain why
>>> you've chosen not to put the small objects in the k/v store as part of
>>> the value rather than in secondary large files?
>> I don't want to co-opt an answer from Alex, but I do want to point to some of the other background on this LOSF work.
>>
>> https://wiki.openstack.org/wiki/Swift/ideas/small_files
>> https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations
>> https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation
>>
> These are great. Thanks for sharing them, I understand a lot more now.
>
>> Look at the second link for some context to your answer, but the summary is "that means writing a file system, and writing a file system is really hard".
>>
> I'm not sure we were thinking the same thing.
>
> I was more asking, why not put the content of the object into the k/v
> instead of the big_file_id:offset? My thinking was that for smaller
> objects, you would just return the data immediately upon reading the k/v,
> rather than then needing to go find the big file and read the offset.
> However, I'm painfully aware that those directly involved with the problem
> have likely thought of this. However, the experiments don't seem to show
> that this was attempted. Perhaps I'm zooming too far out to see the real
> problem space. You can all tell me to take my spray paint can and stop
> staring at the bike shed if this is just too annoying. Seriously.
>
> Of course, one important thing is, what does one consider "small"? Seems
> like there's a size where the memory footprint of storing it in the
> k/v would be justifiable if reads just returned immediately from k/v
> vs. needing to also go get data from a big file on disk. Perhaps that
> size is too low to really matter. I was hoping that this had been
> considered and there was documentation, but I don't really see it.
Right, we had considered this when we started the project : storing
small objects directly in the KV. It would not be too diffcult to do,
but we see a few problems :
1) consistency
In the current design, we append data at the end of a "big file". When
the data upload is finished, swift writes the metadata and commits the
file. This triggers a fsync(). Only then do we return. We can rely on
the data being stable on disk, even if there is a power loss. Because
we fallocate() space for the "big files" beforehand, we can also hope to
have mostly sequential disk IO.
(Important as most swift clusters use SATA disks).
Once the object has been committed, we create an entry for it in the KV.
This is done asynchronously, because synchronous writes on the KV kills
performance. If we loose power, we loose the latest data. After the
server is rebooted, we have to scan the end of volumes to create missing
entries in the KV. (I will not discuss this in detail in this email to
keep this short, but we can discuss it in another thread, or I can post
some information on the wiki).
If we put small objects in the KV, we would need to do synchronous
writes to make sure we don't loose data.
Also, currently we can completly reconstruct the KV from the "big
files". It would not be possible anymore.
2) performance
On our clusters we see about 40% of physical disk IO being caused by
readdir().
We want to serve directory listing requests from memory. So "small"
means "the KV can fit in the page cache".
We estimate that we need the size per object to be below 50 bytes, which
doesn't leave much room for data.
LevelDB causes write amplification, as it will regularly copy data to
different files (levels) to keep keys compressed and in sorted order. If
we store object data within the KV, it will be copied around multiple
times as well.
Finally it is also more simple to have only one path to handle. Beyond
these issues, it would not be difficult to store data in the KV. This is
something we can revisit after more test and maybe some production
experience.
>
> Also the "writing your own filesystem" option in experiments seemed
> more like a thing to do if you left the k/v stores out entirely.
More information about the OpenStack-dev
mailing list