[openstack-dev] Feedback about Swift API - Especially about Large Objects

Clay Gerrard clay.gerrard at gmail.com
Fri Oct 9 21:24:26 UTC 2015


A lot of these deficiencies are drastically improved with static large
objects - and non-trivial to address (impossible?) with DLO's because of
their dynamic nature.  It's unfortunate, but DLO's don't really serve your
use-case very well - and you should find a way to transition to SLO's [1].

We talked about improving the checksumming behavior in SLO's for the
general naive sync case back at the hack-a-thon before the Vancouver summit
- but it's tricky (MD5 => CRC) - and would probably require a API version
bump.

All we've been able to get done so far is improve the native client
handling [2] - but if using SLO's you may find a similar solution quite
manageable.

Thanks for the feedback.

-Clay

1.
http://docs-draft.openstack.org/91/219991/7/check/gate-swift-docs/75fb84c//doc/build/html/overview_large_objects.html#module-swift.common.middleware.slo
2.
https://github.com/openstack/python-swiftclient/commit/ff0b3b02f07de341fa9eb81156ac2a0565d85cd4

On Friday, October 9, 2015, Pierre SOUCHAY <pierre.souchay at cloudwatt.com>
wrote:

> Hi Swift Developpers,
>
> We have been using Swift as a IAAS provider for more than two years now,
> but this mail is about feedback on the API side. I think it would be great
> to include some of the ideas in future revisions of API.
>
> I’ve been developping a few Swift clients in HTML (in Cloudwatt Dashboard)
> with CORS, Java with Swing GUI (
> https://github.com/pierresouchay/swiftbrowser) and Go for Swift to
> filesystem (https://github.com/pierresouchay/swiftsync/), so I have now a
> few ideas about how improving a bit the API.
>
> The API is quite straightforward and intuitive to use, and writing a
> client is now that difficult, but unfortunately, the Large Object support
> is not easy at all to deal with.
>
> The biggest issue is that there is now way to know whenever a file is a
> large object when performing listings using JSON format, since, AFAIK a
> large object is an object with 0 bytes (so its size in bytes is 0), but it
> also has a hash of a zero file bytes.
>
> For instance, a signature of such object is :
>  {"hash": "d41d8cd98f00b204e9800998ecf8427e", "last_modified":
> "2015-06-04T10:23:57.618760", "bytes": 0, "name": "5G", "content_type": "
> octet/stream"}
>
> which is, exactly the hash of a 0 bytes file :
> $ echo -n | md5
> d41d8cd98f00b204e9800998ecf8427e
>
> Ok, now lets try HEAD :
> $ curl -vv -XHEAD -H X-Auth-Token:$TOKEN '
> https://storage.fr1.cloudwatt.com/v1/AUTH_61b8fe6dfd0a4ce69f6622ea74444e0f/large_files/5G
>> < HTTP/1.1 200 OK
> < Date: Fri, 09 Oct 2015 19:43:09 GMT
> < Content-Length: 5000000000
> < Accept-Ranges: bytes
> < X-Object-Manifest: large_files/5G/.part-5000000000-
> < Last-Modified: Thu, 04 Jun 2015 10:16:33 GMT
> < Etag: "479517ec4767ca08ed0547dca003d116"
> < X-Timestamp: 1433413437.61876
> < Content-Type: octet/stream
> < X-Trans-Id: txba36522b0b7743d683a5d-00561818cd
>
> WTF ? While all files have the same value for ETag and hash, this is not
> the case for Large files…
>
> Furthermore, the ETag is not the md5 of the whole file, but the hash of
> the hash of all manifest files (as described somewhere hidden deeply in the
> documentation)
>
> Why this is a problem ?
> -------------------------------
>
> Imagine a « naive »  client using the API which performs some kind of Sync.
>
> The client download each file and when it syncs, compares the local md5 to
> the md5 of the listing… of course, the hash is the hash of a zero bytes
> files… so it downloads the file again… and again… and again. Unfortunaly
> for our naive client, this is exactly the kind of files we don’t want to
> download twice… since the file is probably huge (after all, it has been
> split for a reason no ?)
>
> I think this is really a design flaw since you need to know everything
> about Swift API and extensions to have a proper behavior. The minimum would
> be to at least return the same value as the ETag header.
>
> OK, let’s continue…
>
> We are not so Naive… our Swift Sync client know that 0 files needs more
> work.
>
> * First issue: we have to know whenever the file is a « real » 0 bytes
> file or not. You may think most people do not create 0 bytes files after
> all… this is dummy. Actually, some I have seen two Object Storage
> middleware using many 0 bytes files (for instance to store meta data or two
> set up some kind of directory like structure). So, in this cas, we need to
> perform a HEAD request to each 0 bytes files. If you have 1000 files like
> this, you have to perform 1000 HEAD requests to finally know that there are
> not any Large file. Not very efficient. Your Swift Sync client took 1
> second to sync 20G of data with naive approach, now, you need 5 minutes…
> hash of 0 bytes is not a good idea at all.
>
> * Second issue: since the hash is the hash of all parts (I have an idea
> about why this decision was made, probably for performance reasons), your
> client cannot work on files since the hash of local file is not the hash of
> the Swift aggregated file (which is the hash of all the hash of manifest).
> So, it means you cannot work on existing data, you have to either :
>  - split all the files in the same way as the manifest, compute the MD5 of
> each part, than compute the MD5 of the hashes and compare to the MD5 on
> server… (ok… doable, but I gave up with such system)
>  - have a local database in your client (when you download, store the REAL
> Hash of file and store that in fact you have to compare it the the HASH
> returned by server)
>  - perform some kind of crappy heuristics (size + grab the starting bytes
> of each data of each part or something like that…)
>
> * Third issue:
>  - If you don’t want to store the parts of your object file, you have to
> wait for all your HEAD requests to finish since it is the only way to guess
> all the files that are referenced in your manifest headers.
>
> So summarize, I think the current API really need some refinements about
> the listings since a competent developper may trust the bytes value and the
> hash value and create an algorithm that does not behave nicely. So, the API
> looks easy but is in fact much more complicated than expected.
>
> A few ideas to improve it :
>
> In listings, if an Object is a large object.
>  - either put the real MD5 of file if it is doable technically… or remove
> it (so naive program will work nicely)… same thing about bytes.
>  - add an optional field in the JSON to tell the object is in fact a large
> object. A nice field to explain the object is a large object would be to
> use the object-manifest header value. So a client could know the file is a
> large file or simply a zero byte object, and also know what are the object
> that are in facts parts of a larger one (and do not wait for you thousands
> of HEAD requests to finish)
>
> Finally, to help people creating interfaces quickly, add an Option to add
> CORS for all containers of an account. In our Cloud provider, we added a
> REST CALL in another WebService with CORS enabled that ensures a container
> has CORS setup for a Container. So, browsing Swift with HTML5 interfaces is
> easy. By doing so, it would - I think - greatly increase the Swift Usage
> (by not needing any specific software to browse Swift).
>
> Best Regards
>
>
> --
> Pierre Souchay <pierre.souchay at cloudwatt.com>
> Software Architect @ CloudWatt
>
> Adresse : ETIK 892, Rue Yves Kermen 92100 Boulogne-Billancourt
> N° Standard : +33 1 84 01 04 04
> N° Fax : +33 1 84 01 04 05
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20151009/ee485c6f/attachment.html>


More information about the OpenStack-dev mailing list