[Openstack] Enabling data deduplication on Swift

andi abes andi.abes at gmail.com
Sat Mar 10 21:13:07 UTC 2012


Maybe a happy path exists, between efficiency and correctness ;) I
think the Rsync is probably a good comparison to the use case at hand
(it identifies identical blocks between the source and target, and
only sends deltas of the wire).
It combines a quick has to identify candidates that might be
duplicates, but relies on comparison to ensure that the match is real,
and not just a hash collision.

See the source of all knowledge:
http://en.wikipedia.org/wiki/Rsync#Algorithm






On Sat, Mar 10, 2012 at 1:15 PM, Maru Newby <mnewby at internap.com> wrote:
> Hi Joe,
>
> There's one huge difference between page deduplication and object
> deduplication:  Page size is small and predictable, whereas object size is
> not.  Given this, full compares would not be a good way to implement
> performant object deduplication in swift.
>
> Thanks,
>
>
> Maru
>
>
> On 2012-03-10, at 9:57 AM, Joe Gordon wrote:
>
> Paulo, Caitlin,
>
>
> Can SHA-1 collisions be generated?  If so can you point me to the article?
>
> Also why compare hashes in the first place?  Linux 'Kenel Samepage Merging',
> which does page deduplication for KVM, does a full compare to be safe [1].
>  Even if collisions can't be generated, what are the odds of a collision
> (for SHA-1 and SHA-256) happening by chance when using Swift at scale?
>
>
> best,
> Joe Gordon
>
>
>
>
> [1] http://www.linux-kvm.com/sites/default/files/KvmForum2008_KSM.pdf
>
>
> On Fri, Mar 9, 2012 at 4:44 PM, Caitlin Bestler
> <Caitlin.Bestler at nexenta.com> wrote:
>>
>> Paulo,
>>
>>
>>
>> I believe you’ll find that we’re thinking along the same lines. Please
>> review my proposal at http://etherpad.openstack.org/P9MMYSWE6U
>>
>>
>>
>> One quick observation is that SHA-1 is totally inadequate for
>> fingerprinting objects in a public object store. An attacker could easily
>>
>> predict the fingerprint of content likely to be posted, generate alternate
>> content that had the same SHA-1 fingerprint and pre-empt
>>
>> the signature. For example: an ISO of an open source OS distribution. If I
>> get my false content with the same fingerprint into the
>>
>> repository first then everyone who downloads that ISO will get my altered
>> copy.
>>
>>
>>
>> SHA-256 is really needed to make this type of attack infeasible.
>>
>>
>>
>> I also think that distributed deduplication works very well with object
>> versioning. Your comments on the proposal cited above
>>
>> would be great to hear.
>>
>>
>>
>> From: openstack-bounces+caitlin.bestler=nexenta.com at lists.launchpad.net
>> [mailto:openstack-bounces+caitlin.bestler=nexenta.com at lists.launchpad.net]
>> On Behalf Of Paulo Ricardo Motta Gomes
>> Sent: Thursday, March 08, 2012 1:19 PM
>> To: openstack at lists.launchpad.net
>>
>>
>> Subject: [Openstack] Enabling data deduplication on Swift
>>
>>
>>
>> Hello everyone,
>>
>>
>>
>> I'm a student of the European Master in Distributed Computing (EMDC)
>> currently working on my master thesis on distributed content-addressable
>> storage/deduplication.
>>
>>
>>
>> I'm happy to announce I will be contributing the outcome of my thesis work
>> to OpenStack by enabling both object-level and block-level deduplication
>> functionality on Swift
>> (https://answers.launchpad.net/swift/+question/156862).
>>
>>
>>
>> I have written a detailed blog post where I describe the initial
>> architecture of my
>> solution: http://paulormg.com/2012/03/05/enabling-deduplication-in-a-distributed-object-storage/
>>
>>
>>
>> Feedback from the OpenStack/Swift community would be very appreciated.
>>
>>
>>
>> Cheers,
>>
>>
>>
>> Paulo
>>
>>
>>
>> --
>> European Master in Distributed Computing - www.kth.se/emdc
>> Royal Institute of Technology - KTH
>>
>> Instituto Superior Técnico - IST
>>
>> http://paulormg.com
>>
>>
>> _______________________________________________
>> Mailing list: https://launchpad.net/~openstack
>> Post to     : openstack at lists.launchpad.net
>> Unsubscribe : https://launchpad.net/~openstack
>> More help   : https://help.launchpad.net/ListHelp
>>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack at lists.launchpad.net
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp
>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~openstack
> Post to     : openstack at lists.launchpad.net
> Unsubscribe : https://launchpad.net/~openstack
> More help   : https://help.launchpad.net/ListHelp
>




More information about the Openstack mailing list