[openstack-dev] Observations re swift-container usage of SQLite

Taras Glek taras at glek.net
Wed Sep 3 00:05:12 UTC 2014


Hi,
I have done some SQLite footgun elimination at Mozilla, was curious if
swift ran into similar issues.
>From blog posts like
http://blog.maginatics.com/2014/05/13/multi-container-sharding-making-openstack-swift-swifter/
 and http://engineering.spilgames.com/openstack-swift-lots-small-files/ it
seemed worth looking into.

*Good things*
* torgomatic pointed out on IRC that inserts are now batched via an
intermediate file that isn't fsync()ed(
https://github.com/openstack/swift/commit/85362fdf4e7e70765ba08cee288437a763ea5475).
That should help with usecases described by above blog posts. Hope rest of
my observations are still of some use.
* There are few indexes involved, this is good because indexes in
single-file databases are very risky for perf.

I setup devstack on my laptop to observe swift performance and poke at the
resulting db. I don't have a proper benchmarking environment to check if
any of my observations are valid.

*Container .db handle LRU*
It seems that container DBs are opened once per read/write operation:
having container-server keep LRU list of db handles might help workloads
with hot containers

*Speeding up LIST*
* Lack of index for LIST is good, but means LIST will effectively read
whole file.
* 1024 byte pagesize is used, moving to bigger pagesizes, reduces numer of
syscalls
** Firefox moving to 1K->32K cut our DB IO by 1.2-2x
http://taras.glek.net/blog/2013/06/28/new-performance-people/
* Doing fadvise(WILL_NEED) on the db file prior to opening it with SQLite
should help OS read the db file in at maximum throughput. This causes Linux
to issue disk IO in 2mb chunks vs 128K with default readahead settings.
SQLite should really do this itself :(
* Appends end up fragmenting the db file, should use
http://www.sqlite.org/c3ref/c_fcntl_chunk_size.html
<http://www.sqlite.org/c3ref/c_fcntl_chunk_size.html#sqlitefcntlchunksize>
<http://piratepad.net/ep/search?query=sqlitefcntlchunksize>
#sqlitefcntlchunksize
<http://www.sqlite.org/c3ref/c_fcntl_chunk_size.html#sqlitefcntlchunksize> to
grow DB with less fragmentation OR copy(with fallocate) sqlite file over
every time it doubles in size(eg during weekly compaction)
** Fragmentation means db scans are non-sequential on disk
** XFS is particularly susceptible to fragmentation. Can use filefrag on
.db files to monitor fragmentation

*Write amplification*
* write amplification is bad because it causes table scans to be slower
than necessary(eg reading less data is always better for cache locality;
torgomatic says container dbs can get into gigabytes)
* swift uses timestamps in decimal seconds form..eg 1409350185.26144 as a
string. I'm guessing these are mainly used for HTTP headers yet HTTP uses
seconds, which would normally only take up 4 bytes
* CREATE INDEX ix_object_deleted_name ON object (deleted, name) might be a
problem for delete-heavy workloads
** SQLite copies column entries used in indexes. Here the index almost
doubles amount of space used by deleted entries
** Indexes in general are risky in sqlite, as they end up dispersed with
table data until a VACUUM. This causes table scan operations(eg during
LIST) to be suboptimal. This could also mean that operations that rely on
the index are no better IO-wise than a whole table scan.
* deleted is both in content type & deleted field. This might not be a big
deal.
* Ideally you'd be using a database that can be (lz4?) compressed at a
whole-file level. I'm not aware of a good off-the-shelf solution here. Some
column store might be a decent replacement for SQLite

Hope some of these observations are useful. If not, sorry for the noise.
I'm pretty impressed at swift-container's minimalist SQLite usage, did not
see many footguns here.

Taras
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140902/b52be6ad/attachment.html>


More information about the OpenStack-dev mailing list