[openstack-dev] [swift] Optimizing storage for small objects in Swift

Alexandre Lécuyer alexandre.lecuyer at corp.ovh.com
Mon Jun 19 09:36:15 UTC 2017

Hello Clint,

Thanks for your feedback, replying in the email inline.

On 06/16/2017 10:54 PM, Clint Byrum wrote:
> Excerpts from John Dickinson's message of 2017-06-16 11:35:39 -0700:
>> On 16 Jun 2017, at 10:51, Clint Byrum wrote:
>>> This is great work.
>>> I'm sure you've already thought of this, but could you explain why
>>> you've chosen not to put the small objects in the k/v store as part of
>>> the value rather than in secondary large files?
>> I don't want to co-opt an answer from Alex, but I do want to point to some of the other background on this LOSF work.
>> https://wiki.openstack.org/wiki/Swift/ideas/small_files
>> https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations
>> https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation
> These are great. Thanks for sharing them, I understand a lot more now.
>> Look at the second link for some context to your answer, but the summary is "that means writing a file system, and writing a file system is really hard".
> I'm not sure we were thinking the same thing.
> I was more asking, why not put the content of the object into the k/v
> instead of the big_file_id:offset? My thinking was that for smaller
> objects, you would just return the data immediately upon reading the k/v,
> rather than then needing to go find the big file and read the offset.
> However, I'm painfully aware that those directly involved with the problem
> have likely thought of this. However, the experiments don't seem to show
> that this was attempted. Perhaps I'm zooming too far out to see the real
> problem space. You can all tell me to take my spray paint can and stop
> staring at the bike shed if this is just too annoying. Seriously.
> Of course, one important thing is, what does one consider "small"? Seems
> like there's a size where the memory footprint of storing it in the
> k/v would be justifiable if reads just returned immediately from k/v
> vs. needing to also go get data from a big file on disk. Perhaps that
> size is too low to really matter. I was hoping that this had been
> considered and there was documentation, but I don't really see it.
Right, we had considered this when we started the project : storing 
small objects directly in the KV. It would not be too diffcult to do, 
but we see a few problems :

1) consistency
In the current design, we append data at the end of a "big file". When 
the data upload is finished, swift writes the metadata and commits the 
file. This triggers a fsync(). Only then do we return. We can rely on 
the data being stable on disk, even if there is a power loss.  Because 
we fallocate() space for the "big files" beforehand, we can also hope to 
have mostly sequential disk IO.
(Important as most swift clusters use SATA disks).

Once the object has been committed, we create an entry for it in the KV. 
This is done asynchronously, because synchronous writes on the KV kills 
performance. If we loose power, we loose the latest data. After the 
server is rebooted, we have to scan the end of volumes to create missing 
entries in the KV. (I will not discuss this in detail in this email to 
keep this short, but we can discuss it in another thread, or I can post 
some information on the wiki).

If we put small objects in the KV, we would need to do synchronous 
writes to make sure we don't loose data.
Also, currently we can completly reconstruct the KV from the "big 
files". It would not be possible anymore.

2) performance
On our clusters we see about 40% of physical disk IO being caused by 
We want to serve directory listing requests from memory. So "small" 
means "the KV can fit in the page cache".
We estimate that we need the size per object to be below 50 bytes, which 
doesn't leave much room for data.

LevelDB causes write amplification, as it will regularly copy data to 
different files (levels) to keep keys compressed and in sorted order. If 
we store object data within the KV, it will be copied around multiple 
times as well.

Finally it is also more simple to have only one path to handle. Beyond 
these issues, it would not be difficult to store data in the KV. This is 
something we can revisit after more test and maybe some production 

> Also the "writing your own filesystem" option in experiments seemed
> more like a thing to do if you left the k/v stores out entirely.

More information about the OpenStack-dev mailing list