[openstack-dev] [swift] Optimizing storage for small objects in Swift

Clint Byrum clint at fewbar.com
Fri Jun 16 17:51:43 UTC 2017


This is great work.

I'm sure you've already thought of this, but could you explain why
you've chosen not to put the small objects in the k/v store as part of
the value rather than in secondary large files?

Excerpts from Alexandre Lécuyer's message of 2017-06-16 15:54:08 +0200:
> Swift stores objects on a regular filesystem (XFS is recommended), one file per object. While it works fine for medium or big objects, when you have lots of small objects you can run into issues: because of the high count of inodes on the object servers, they can’t stay in cache, implying lot of memory usage and IO operations to fetch inodes from disk.
> 
> In the past few months, we’ve been working on implementing a new storage backend in Swift. It is highly inspired by haystack[1]. In a few words, objects are stored in big files, and a Key/Value store provides information to locate an object (object hash -> big_file_id:offset). As the mapping in the K/V consumes less memory than an inode, it is possible to keep all entries in memory, saving many IO to locate the object. It also allows some performance improvements by limiting the XFS meta updates (e.g.: almost no inode updates as we write objects by using fdatasync() instead of fsync())
> 
> One of the questions that was raised during discussions about this design is: do we want one K/V store per device, or one K/V store per Swift partition (= multiple K/V per device). The concern was about failure domain. If the only K/V gets corrupted, the whole device must be reconstructed. Memory usage is a major point in making a decision, so we did some benchmark.
> 
> The key-value store is implemented over LevelDB.
> Given a single disk with 20 million files (could be either one object replica or one fragment, if using EC)
> 
> I have tested three cases :
>    - single KV for the whole disk
>    - one KV per partition, with 100 partitions per disk
>    - one KV per partition, with 1000 partitions per disk
> 
> Single KV for the disk :
>    - DB size: 750 MB
>    - bytes per object: 38
> 
> One KV per partition :
> Assuming :
>    - 100 partitions on the disk (=> 100 KV)
>    - 16 bits part power (=> all keys in a given KV will have the same 16 bit prefix)
> 
>    - 7916 KB per KV, total DB size: 773 MB
>    - bytes per object: 41
> 
> One KV per partition :
> Assuming :
>    - 1000 partitions on the disk (=> 1000 KV)
>    - 16 bits part power (=> all keys in a given KV will have the same 16 bit prefix)
> 
>    - 1388 KB per KV, total DB size: 1355 MB total
>    - bytes per object: 71
>    
> 
> A typical server we use for swift clusters has 36 drives, which gives us :
> - Single KV : 26 GB
> - Split KV, 100 partitions : 28 GB (+7%)
> - Split KV, 1000 partitions : 48 GB (+85%)
> 
> So, splitting seems reasonable if you don't have too many partitions.
> 
> Same test, with 10 million files instead of 20
> 
> - Single KV : 13 GB
> - Split KV, 100 partitions : 18 GB (+38%)
> - Split KV, 1000 partitions : 24 GB (+85%)
> 
> 
> Finally, if we run a full compaction on the DB after the test, you get the
> same memory usage in all cases, about 32 bytes per object.
> 
> We have not made enough tests to know what would happen in production. LevelDB
> does trigger compaction automatically on parts of the DB, but continuous change
> means we probably would not reach the smallest possible size.
> 
> 
> Beyond the size issue, there are other things to consider :
> File descriptors limits : LevelDB seems to keep at least 4 file descriptors open during operation.
> 
> Having one KV per partition also means you have to move entries between KVs when you change the part power. (if we want to support that)
> 
> A compromise may be to split KVs on a small prefix of the object's hash, independent of swift's configuration.
> 
> As you can see we're still thinking about this. Any ideas are welcome !
> We will keep you updated about more "real world" testing. Among the tests we plan to check how resilient the DB is in case of a power loss.
> 



More information about the OpenStack-dev mailing list