[Openstack] Enabling data deduplication on Swift

Maru Newby mnewby at internap.com
Sun Mar 11 02:59:55 UTC 2012


Ah, right.  I'm sorry to say I hadn't TFA yet and wasn't thinking of block-level deduplication.  But though block-level compares might be possible, the distributed nature of swift would likely make this quite complicated.  It might make sense for KVM to compare memory pages, but I imagine that locality of data (all on the same host) figured heavily in that implementation decision.

I am, however, confused by your assertion that rsync 'relies on comparison to ensure that the match is real, and not just a hash collision'.  

From my reading, rsync computes an MD5 hash and a weaker (and easier to compute) 'rolling checksum' on the recipient data, and then the sender also computes the computationally cheap 'checksum' and will only bother computing the MD5 hash when the checksums match.  So rsync is optimizing hash generation, but is still relying on hash collision rather than 'comparison'.  Or am I missing something?

Regardless, I enjoyed reading Paulo's post and learning about the role deduplication might play in balancing storage and retrieval costs in swift.

Cheers,


Maru


On 2012-03-10, at 1:13 PM, andi abes wrote:

> Maybe a happy path exists, between efficiency and correctness ;) I
> think the Rsync is probably a good comparison to the use case at hand
> (it identifies identical blocks between the source and target, and
> only sends deltas of the wire).
> It combines a quick has to identify candidates that might be
> duplicates, but relies on comparison to ensure that the match is real,
> and not just a hash collision.
> 
> See the source of all knowledge:
> http://en.wikipedia.org/wiki/Rsync#Algorithm
> 
> 
> 
> 
> 
> 
> On Sat, Mar 10, 2012 at 1:15 PM, Maru Newby <mnewby at internap.com> wrote:
>> Hi Joe,
>> 
>> There's one huge difference between page deduplication and object
>> deduplication:  Page size is small and predictable, whereas object size is
>> not.  Given this, full compares would not be a good way to implement
>> performant object deduplication in swift.
>> 
>> Thanks,
>> 
>> 
>> Maru
>> 
>> 
>> On 2012-03-10, at 9:57 AM, Joe Gordon wrote:
>> 
>> Paulo, Caitlin,
>> 
>> 
>> Can SHA-1 collisions be generated?  If so can you point me to the article?
>> 
>> Also why compare hashes in the first place?  Linux 'Kenel Samepage Merging',
>> which does page deduplication for KVM, does a full compare to be safe [1].
>>  Even if collisions can't be generated, what are the odds of a collision
>> (for SHA-1 and SHA-256) happening by chance when using Swift at scale?
>> 
>> 
>> best,
>> Joe Gordon
>> 
>> 
>> 
>> 
>> [1] http://www.linux-kvm.com/sites/default/files/KvmForum2008_KSM.pdf
>> 
>> 
>> On Fri, Mar 9, 2012 at 4:44 PM, Caitlin Bestler
>> <Caitlin.Bestler at nexenta.com> wrote:
>>> 
>>> Paulo,
>>> 
>>> 
>>> 
>>> I believe you’ll find that we’re thinking along the same lines. Please
>>> review my proposal at http://etherpad.openstack.org/P9MMYSWE6U
>>> 
>>> 
>>> 
>>> One quick observation is that SHA-1 is totally inadequate for
>>> fingerprinting objects in a public object store. An attacker could easily
>>> 
>>> predict the fingerprint of content likely to be posted, generate alternate
>>> content that had the same SHA-1 fingerprint and pre-empt
>>> 
>>> the signature. For example: an ISO of an open source OS distribution. If I
>>> get my false content with the same fingerprint into the
>>> 
>>> repository first then everyone who downloads that ISO will get my altered
>>> copy.
>>> 
>>> 
>>> 
>>> SHA-256 is really needed to make this type of attack infeasible.
>>> 
>>> 
>>> 
>>> I also think that distributed deduplication works very well with object
>>> versioning. Your comments on the proposal cited above
>>> 
>>> would be great to hear.
>>> 
>>> 
>>> 
>>> From: openstack-bounces+caitlin.bestler=nexenta.com at lists.launchpad.net
>>> [mailto:openstack-bounces+caitlin.bestler=nexenta.com at lists.launchpad.net]
>>> On Behalf Of Paulo Ricardo Motta Gomes
>>> Sent: Thursday, March 08, 2012 1:19 PM
>>> To: openstack at lists.launchpad.net
>>> 
>>> 
>>> Subject: [Openstack] Enabling data deduplication on Swift
>>> 
>>> 
>>> 
>>> Hello everyone,
>>> 
>>> 
>>> 
>>> I'm a student of the European Master in Distributed Computing (EMDC)
>>> currently working on my master thesis on distributed content-addressable
>>> storage/deduplication.
>>> 
>>> 
>>> 
>>> I'm happy to announce I will be contributing the outcome of my thesis work
>>> to OpenStack by enabling both object-level and block-level deduplication
>>> functionality on Swift
>>> (https://answers.launchpad.net/swift/+question/156862).
>>> 
>>> 
>>> 
>>> I have written a detailed blog post where I describe the initial
>>> architecture of my
>>> solution: http://paulormg.com/2012/03/05/enabling-deduplication-in-a-distributed-object-storage/
>>> 
>>> 
>>> 
>>> Feedback from the OpenStack/Swift community would be very appreciated.
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> Paulo
>>> 
>>> 
>>> 
>>> --
>>> European Master in Distributed Computing - www.kth.se/emdc
>>> Royal Institute of Technology - KTH
>>> 
>>> Instituto Superior Técnico - IST
>>> 
>>> http://paulormg.com
>>> 
>>> 
>>> _______________________________________________
>>> Mailing list: https://launchpad.net/~openstack
>>> Post to     : openstack at lists.launchpad.net
>>> Unsubscribe : https://launchpad.net/~openstack
>>> More help   : https://help.launchpad.net/ListHelp
>>> 
>> 
>> _______________________________________________
>> Mailing list: https://launchpad.net/~openstack
>> Post to     : openstack at lists.launchpad.net
>> Unsubscribe : https://launchpad.net/~openstack
>> More help   : https://help.launchpad.net/ListHelp
>> 
>> 
>> 
>> _______________________________________________
>> Mailing list: https://launchpad.net/~openstack
>> Post to     : openstack at lists.launchpad.net
>> Unsubscribe : https://launchpad.net/~openstack
>> More help   : https://help.launchpad.net/ListHelp
>> 





More information about the Openstack mailing list