<div>Hi Sam,</div><div> </div><div>I got some thoughts from your mail. EC is more useful for a centered storage solution but not for a distributed one. It will bring a heavy load on internal network traffic.</div><div> </div>

<div>Actually from network performance's point of view, the 3 copies are also sort of the result of compromise. There should be a way to combine them together and tune the parameters under different scenarios. They, however, would bring different reliability and performance.</div>

<div> </div><div>Regards,</div><div>Howard<br></div><div class="gmail_quote">On Thu, Oct 18, 2012 at 7:30 AM, Eugene Kirpichov <span dir="ltr"><<a href="mailto:ekirpichov@gmail.com" target="_blank">ekirpichov@gmail.com</a>></span> wrote:<br>

<blockquote style="margin:0px 0px 0px 0.8ex;padding-left:1ex;border-left-color:rgb(204,204,204);border-left-width:1px;border-left-style:solid" class="gmail_quote">Hi Sam,<br>

<br>

My five cents.<br>

<br>

Using Fountain codes, which are also a class of EC, one can make all<br>

the blocks equivalent in role (no separation into data and parity<br>

blocks).<br>

<a href="http://en.wikipedia.org/wiki/Fountain_code" target="_blank">http://en.wikipedia.org/wiki/Fountain_code</a><br>

<br>

They resolve a few of the issues that you raised, however they may<br>

raise others - e.g. it's more difficult to determine how many blocks<br>

you need to fetch to reconstruct the data.<br>

<div class="HOEnZb"><div class="h5"><br>

On Wed, Oct 17, 2012 at 4:24 PM, Samuel Merritt <<a href="mailto:sam@swiftstack.com">sam@swiftstack.com</a>> wrote:<br>

> On 10/15/12 5:36 PM, Duan, Jiangang wrote:<br>

>><br>

>> Some of our customers are interested in Erasure code than tri-replicate to<br>

>> save disk space.<br>

>> We propose a BP "Light weight Erasure code framework for swift", which can<br>

>> be found here <a href="https://blueprints.launchpad.net/swift/+spec/swift-ec" target="_blank">https://blueprints.launchpad.net/swift/+spec/swift-ec</a><br>

>> The general idea is to have some daemon on storage node to do offline scan<br>

>> - select code object with big enough size to do EC.<br>

>><br>

>> Will glad to hear any feedback on this.<br>

><br>

><br>

> Here, in no particular order, are some thoughts I have.<br>

><br>

> - Object blocks (both data blocks and parity blocks) will need to be marked<br>

> somehow so that 3 replicas of each block aren't kept. This is a pretty<br>

> fundamental change to Swift; up until now, all objects are treated the same.<br>

> It's essentially introducing the notion of tiered storage into Swift.<br>

><br>

> - Who's responsible for ensuring the presence of all the blocks? That is,<br>

> assume you have an object that's been split into ten data blocks (D1, D2,<br>

> ..., D10) and 2 parity blocks (P1, P2). The drive with D7 on it dies. Which<br>

> replicator(s) is(are) responsible for rebuilding D7 and storing it on a<br>

> handoff node?<br>

><br>

> If you have the replicators on each block's machine checking for failures,<br>

> then you'll wind up with more people checking each replica. Here, it would<br>

> be 11 replicators ensuring that each block is present. Compare that to the<br>

> full-replication case, where there are 2 replicators checking on it. That's<br>

> going to result in more traffic on the internal network.<br>

><br>

> - There will need to be throttles on the transformation daemons (replica -><br>

> EC and vice versa), as that's very IO intensive. If a big bunch of data is<br>

> uploaded at one time and then not accessed (think large backups), then that<br>

> could be a ticking time bomb for my cluster performance. After those objects<br>

> become "cold", the transformation daemons will thrash my disks and network<br>

> turning them into EC-type objects.<br>

><br>

> - Does this open up a Swift cluster to a DoS attack? If my objects are<br>

> stored w/EC, then can someone go through and request a few bytes from each<br>

> object in my cluster a few times and cause all my objects to get "hot"?<br>

> Under the proposed scheme, this would turn my objects from EC-storage to<br>

> replica-storage, filling up my disks and killing my cluster. To mitigate<br>

> that, I'd have to keep enough disk around to hold 3 replicas of everything,<br>

> and at that point, I may as well just keep the 3 replicas.<br>

><br>

> - Another thought for a resource-consumption attack: can someone slowly walk<br>

> my objects and make a large fraction (say, 5%) of them hot each day? That<br>

> seems like it would make the transformation daemons run at maximum capacity<br>

> all the time trying to keep up.<br>

><br>

> - Retrieval of EC-stored objects becomes more failure-prone. With<br>

> replica-stored objects, 1 out of 3 object servers has to be available for a<br>

> GET request to work. With EC-stored objects and a 10:2 coding, 10 out of 12<br>

> object servers have to be available. That makes network partitions much<br>

> worse for data availability.<br>

><br>

> - EC-storage is at odds with geographic replication. Of course, Swift<br>

> supports neither one today. However, with geographic replication, one wants<br>

> to have a local replica of each each object in each geographic region, which<br>

> results in more copies for lower latency. With EC-storage, less data is<br>

> stored. When they're combined, the result is a whole lot of traffic across<br>

> slow, expensive WAN links.<br>

><br>

> - Recombining EC-stored object chunks is going to chew up a ton more CPU on<br>

> either the object or proxy servers, depending on which one does it. If the<br>

> proxy, then it'll add more to an already CPU-heavy workload. If the object<br>

> server, then it'll make using big storage boxes less practical (like one of<br>

> the 48-drives-in-4U servers one can buy).<br>

><br>

> - Can one change the EC-coding level? That is, if I'm using 10:2 coding (so<br>

> each object turns into 10 data blocks and 2 parity blocks), can I change<br>

> that later? Will that have massive performance impacts on my cluster as more<br>

> data blocks are computed?<br>

><br>

> It may be that this is like changing the replica count, and the answer is<br>

> "yes, but your cluster will thrash for a long time after you do it".<br>

><br>

> - Where's the original checksum stored? Clearly, each block will have its<br>

> own checksum for the auditors to use. However, if a client issues a request<br>

> like "HEAD /a/c/o", that'll contain the checksum of the original file. Does<br>

> that live somewhere, or will the proxy have to read all the bytes and<br>

> determine the checksum?<br>

><br>

> - I wonder what effect this will have on internal-network traffic. With a<br>

> replica-stored object, the proxy opens one connection to an object server,<br>

> sends a request, gets a response, and streams the bytes out to the client.<br>

><br>

> With an EC-stored object, the proxy has to open connections to, say, 10<br>

> different object servers. Further, if one of the data blocks is unavailable<br>

> (say data block 5), then the proxy has to go ahead and re-request all the<br>

> data blocks plus a parity block so that it can fill in the gaps. That may be<br>

> a significant increase in traffic on Swift's internal network. Further, by<br>

> using such a large number of connections, it considerably increases the<br>

> probability of a connection failure, which would mean more client requests<br>

> would fail with truncated downloads.<br>

><br>

><br>

> Those are all the thoughts I have right now that are coherent enough to put<br>

> into text. Clearly, adding erasure coding (or any other form of tiered<br>

> storage) to Swift is not something undertaken lightly.<br>

><br>

> Hope this helps.<br>

><br>

><br>

> _______________________________________________<br>

> Mailing list: <a href="https://launchpad.net/~openstack" target="_blank">https://launchpad.net/~openstack</a><br>

> Post to     : <a href="mailto:openstack@lists.launchpad.net">openstack@lists.launchpad.net</a><br>

> Unsubscribe : <a href="https://launchpad.net/~openstack" target="_blank">https://launchpad.net/~openstack</a><br>

> More help   : <a href="https://help.launchpad.net/ListHelp" target="_blank">https://help.launchpad.net/ListHelp</a><br>

<br>

<br>

<br>

</div></div><span class="HOEnZb"><font color="#888888">--<br>

Eugene Kirpichov<br>

<a href="http://www.linkedin.com/in/eugenekirpichov" target="_blank">http://www.linkedin.com/in/eugenekirpichov</a><br>

We're hiring! <a href="http://tinyurl.com/mirantis-openstack-engineer" target="_blank">http://tinyurl.com/mirantis-openstack-engineer</a><br>

</font></span><div class="HOEnZb"><div class="h5"><br>

_______________________________________________<br>

Mailing list: <a href="https://launchpad.net/~openstack" target="_blank">https://launchpad.net/~openstack</a><br>

Post to     : <a href="mailto:openstack@lists.launchpad.net">openstack@lists.launchpad.net</a><br>

Unsubscribe : <a href="https://launchpad.net/~openstack" target="_blank">https://launchpad.net/~openstack</a><br>

More help   : <a href="https://help.launchpad.net/ListHelp" target="_blank">https://help.launchpad.net/ListHelp</a><br>

</div></div></blockquote></div><br>