[openstack-dev] [swift] Optimizing storage for small objects in Swift

Clint Byrum clint at fewbar.com
Mon Jun 19 17:23:27 UTC 2017

Excerpts from Alexandre Lécuyer's message of 2017-06-19 11:36:15 +0200:
> Hello Clint,
> Thanks for your feedback, replying in the email inline.
> On 06/16/2017 10:54 PM, Clint Byrum wrote:
> > Excerpts from John Dickinson's message of 2017-06-16 11:35:39 -0700:
> >> On 16 Jun 2017, at 10:51, Clint Byrum wrote:
> >>
> >>> This is great work.
> >>>
> >>> I'm sure you've already thought of this, but could you explain why
> >>> you've chosen not to put the small objects in the k/v store as part of
> >>> the value rather than in secondary large files?
> >> I don't want to co-opt an answer from Alex, but I do want to point to some of the other background on this LOSF work.
> >>
> >> https://wiki.openstack.org/wiki/Swift/ideas/small_files
> >> https://wiki.openstack.org/wiki/Swift/ideas/small_files/experimentations
> >> https://wiki.openstack.org/wiki/Swift/ideas/small_files/implementation
> >>
> > These are great. Thanks for sharing them, I understand a lot more now.
> >
> >> Look at the second link for some context to your answer, but the summary is "that means writing a file system, and writing a file system is really hard".
> >>
> > I'm not sure we were thinking the same thing.
> >
> > I was more asking, why not put the content of the object into the k/v
> > instead of the big_file_id:offset? My thinking was that for smaller
> > objects, you would just return the data immediately upon reading the k/v,
> > rather than then needing to go find the big file and read the offset.
> > However, I'm painfully aware that those directly involved with the problem
> > have likely thought of this. However, the experiments don't seem to show
> > that this was attempted. Perhaps I'm zooming too far out to see the real
> > problem space. You can all tell me to take my spray paint can and stop
> > staring at the bike shed if this is just too annoying. Seriously.
> >
> > Of course, one important thing is, what does one consider "small"? Seems
> > like there's a size where the memory footprint of storing it in the
> > k/v would be justifiable if reads just returned immediately from k/v
> > vs. needing to also go get data from a big file on disk. Perhaps that
> > size is too low to really matter. I was hoping that this had been
> > considered and there was documentation, but I don't really see it.
> Right, we had considered this when we started the project : storing 
> small objects directly in the KV. It would not be too diffcult to do, 
> but we see a few problems :
> 1) consistency
> In the current design, we append data at the end of a "big file". When 
> the data upload is finished, swift writes the metadata and commits the 
> file. This triggers a fsync(). Only then do we return. We can rely on 
> the data being stable on disk, even if there is a power loss.  Because 
> we fallocate() space for the "big files" beforehand, we can also hope to 
> have mostly sequential disk IO.
> (Important as most swift clusters use SATA disks).
> Once the object has been committed, we create an entry for it in the KV. 
> This is done asynchronously, because synchronous writes on the KV kills 
> performance. If we loose power, we loose the latest data. After the 
> server is rebooted, we have to scan the end of volumes to create missing 
> entries in the KV. (I will not discuss this in detail in this email to 
> keep this short, but we can discuss it in another thread, or I can post 
> some information on the wiki).
> If we put small objects in the KV, we would need to do synchronous 
> writes to make sure we don't loose data.
> Also, currently we can completly reconstruct the KV from the "big 
> files". It would not be possible anymore.
> 2) performance
> On our clusters we see about 40% of physical disk IO being caused by 
> readdir().
> We want to serve directory listing requests from memory. So "small" 
> means "the KV can fit in the page cache".
> We estimate that we need the size per object to be below 50 bytes, which 
> doesn't leave much room for data.
> LevelDB causes write amplification, as it will regularly copy data to 
> different files (levels) to keep keys compressed and in sorted order. If 
> we store object data within the KV, it will be copied around multiple 
> times as well.
> Finally it is also more simple to have only one path to handle. Beyond 
> these issues, it would not be difficult to store data in the KV. This is 
> something we can revisit after more test and maybe some production 
> experience.

Really great explanation. Thanks for sharing. I hope we can all learn
from the thorough approach you've taken to this problem. Good luck!

> >
> > Also the "writing your own filesystem" option in experiments seemed
> > more like a thing to do if you left the k/v stores out entirely.

More information about the OpenStack-dev mailing list