[Openstack] ask for comments - Light weight Erasure code framework for swift

Samuel Merritt sam at swiftstack.com
Wed Oct 17 23:24:04 UTC 2012


On 10/15/12 5:36 PM, Duan, Jiangang wrote:
> Some of our customers are interested in Erasure code than tri-replicate to save disk space.
> We propose a BP "Light weight Erasure code framework for swift", which can be found here https://blueprints.launchpad.net/swift/+spec/swift-ec
> The general idea is to have some daemon on storage node to do offline scan - select code object with big enough size to do EC.
>
> Will glad to hear any feedback on this.

Here, in no particular order, are some thoughts I have.

- Object blocks (both data blocks and parity blocks) will need to be 
marked somehow so that 3 replicas of each block aren't kept. This is a 
pretty fundamental change to Swift; up until now, all objects are 
treated the same. It's essentially introducing the notion of tiered 
storage into Swift.

- Who's responsible for ensuring the presence of all the blocks? That 
is, assume you have an object that's been split into ten data blocks 
(D1, D2, ..., D10) and 2 parity blocks (P1, P2). The drive with D7 on it 
dies. Which replicator(s) is(are) responsible for rebuilding D7 and 
storing it on a handoff node?

If you have the replicators on each block's machine checking for 
failures, then you'll wind up with more people checking each replica. 
Here, it would be 11 replicators ensuring that each block is present. 
Compare that to the full-replication case, where there are 2 replicators 
checking on it. That's going to result in more traffic on the internal 
network.

- There will need to be throttles on the transformation daemons (replica 
-> EC and vice versa), as that's very IO intensive. If a big bunch of 
data is uploaded at one time and then not accessed (think large 
backups), then that could be a ticking time bomb for my cluster 
performance. After those objects become "cold", the transformation 
daemons will thrash my disks and network turning them into EC-type objects.

- Does this open up a Swift cluster to a DoS attack? If my objects are 
stored w/EC, then can someone go through and request a few bytes from 
each object in my cluster a few times and cause all my objects to get 
"hot"? Under the proposed scheme, this would turn my objects from 
EC-storage to replica-storage, filling up my disks and killing my 
cluster. To mitigate that, I'd have to keep enough disk around to hold 3 
replicas of everything, and at that point, I may as well just keep the 3 
replicas.

- Another thought for a resource-consumption attack: can someone slowly 
walk my objects and make a large fraction (say, 5%) of them hot each 
day? That seems like it would make the transformation daemons run at 
maximum capacity all the time trying to keep up.

- Retrieval of EC-stored objects becomes more failure-prone. With 
replica-stored objects, 1 out of 3 object servers has to be available 
for a GET request to work. With EC-stored objects and a 10:2 coding, 10 
out of 12 object servers have to be available. That makes network 
partitions much worse for data availability.

- EC-storage is at odds with geographic replication. Of course, Swift 
supports neither one today. However, with geographic replication, one 
wants to have a local replica of each each object in each geographic 
region, which results in more copies for lower latency. With EC-storage, 
less data is stored. When they're combined, the result is a whole lot of 
traffic across slow, expensive WAN links.

- Recombining EC-stored object chunks is going to chew up a ton more CPU 
on either the object or proxy servers, depending on which one does it. 
If the proxy, then it'll add more to an already CPU-heavy workload. If 
the object server, then it'll make using big storage boxes less 
practical (like one of the 48-drives-in-4U servers one can buy).

- Can one change the EC-coding level? That is, if I'm using 10:2 coding 
(so each object turns into 10 data blocks and 2 parity blocks), can I 
change that later? Will that have massive performance impacts on my 
cluster as more data blocks are computed?

It may be that this is like changing the replica count, and the answer 
is "yes, but your cluster will thrash for a long time after you do it".

- Where's the original checksum stored? Clearly, each block will have 
its own checksum for the auditors to use. However, if a client issues a 
request like "HEAD /a/c/o", that'll contain the checksum of the original 
file. Does that live somewhere, or will the proxy have to read all the 
bytes and determine the checksum?

- I wonder what effect this will have on internal-network traffic. With 
a replica-stored object, the proxy opens one connection to an object 
server, sends a request, gets a response, and streams the bytes out to 
the client.

With an EC-stored object, the proxy has to open connections to, say, 10 
different object servers. Further, if one of the data blocks is 
unavailable (say data block 5), then the proxy has to go ahead and 
re-request all the data blocks plus a parity block so that it can fill 
in the gaps. That may be a significant increase in traffic on Swift's 
internal network. Further, by using such a large number of connections, 
it considerably increases the probability of a connection failure, which 
would mean more client requests would fail with truncated downloads.


Those are all the thoughts I have right now that are coherent enough to 
put into text. Clearly, adding erasure coding (or any other form of 
tiered storage) to Swift is not something undertaken lightly.

Hope this helps.




More information about the Openstack mailing list