[Openstack] [Swift]Support reading from archives
    Vyacheslav Rafalskiy 
    rafalskiy at gmail.com
       
    Wed Feb 19 21:43:19 UTC 2014
    
    
  
Hi all,
This is an attempt to activate the discussion of the following patch, which
introduces support for reading from archives:
https://review.openstack.org/#q,topic:bp/read-from-archives,n,z
Some comments are already reflected in the patch (thanks Christian Schwede
and Michael Barton), see also Discussion below.
Motivation
----------
Currently Swift is not optimal for storing billions of small files. This is
a consequence of the fact that every object in Swift is a file on the
underlying file system (not counting the replicas). Every file requires its
metadata to be loaded into memory before it can be processed. Metadata is
normally cached by the file system but when the total number of files is
too large and access is fairly random caching no longer works and
performance quickly degrades. The Swift's container or tenant catalogs held
in sqlite databases don't offer stellar performance either when the number
of items in them goes into millions.
An alternative for this use case could be a database such as HBase or
Cassandra, which know how to deal with BLOBs. Databases have their ways to
aggregate data in large files and then find it when necessary. However,
database-as-storage have their own problems, one of which is added
complexity.
The above patch offers a way around the Swift's limitation for one specific
but important use case:
 1. one needs to store many small(ish) files, say 1-100KB, which when
stored separately cause performance degradation
 2. these files don't change (too often) such as in data warehouse
 3. a random access to the files is necessary
Solution
--------
The suggested solution is to aggregate the small files in archives, such as
zip or tar, of reasonable size. The archives can only be written as a
whole. They can, of course, be read as a whole with the existing Swift's
GET command like (pseudocode):
GET /tenant/container/big_file.zip
The patch modifies the behavior of the command if additional parameters are
present, for example:
GET /tenant/container/big_file.zip?as=zip&list_content
will result in plain/text response with a list of files in the zip
GET
/tenant/container/big_file.zip?as=zip&get_content=content_file1.png,content_file2.bin
will bring a multipart response with the requested files as binary
attachments
The additional GET functionality must be activated in the config file or
there will be no change in Swift's behavior.
The total size of attachments is limited to prevent "explosion" attack when
decompressing files.
Discussion
----------
Some concerns were raised:
1. Decompression can put a significant additional load on object server
True.
To mitigate on the client side: store files in archive rather than compress
them. You can pre-compress them before storing.
If a consern to service provider: do not activate the feature
2. The response should be streamed rather than provided as a whole
I don't think so.
If you follow the use case, the total size of the archive should be
"reasonable", meaning not too small. However, if your archives are larger
than a couple of megabytes you are doing it wrong. A get_content request
would normally include only a small portion of the archive so no streaming
is necessary.
TODO
----
Tests
Thanks,
Vyacheslav
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20140219/54878f2b/attachment.html>
    
    
More information about the Openstack
mailing list