[Openstack] [Swift] Deciding on EC fragment config

John Dickinson me at not.mn
Wed Apr 4 20:25:38 UTC 2018


The answer always starts with "it depends...". Depends on your hardware, where it's physically located, the durability you need, the access patterns, etc

There have been whole phd dissertations on the right way to calculate durability. Two parity segments isn't exactly equivalent to three replicas because in the EC case you've also got to figure out the chance of failure to get all of the necessary remaining segments to satisfy a read request[1].

In your case, using 3 or 4 parity bits will probably get you better durability and availability than a 3x replica system and still use less overall drive space[2]. My company's product has three "canned" EC policy settings to make it simpler for customers to choose. We've got 4+3, 8+4, and 15+4 settings, and we steer people to one of them based on how many servers are in their cluster.

Note that there's nothing special about the m=4 examples in Swift's docs, at least in the sense of recommending 4 parity as better than 3 or 5 (or any other number).

In your case, you'll want to take into account how many drives you can lose and how many servers you can lose. Suppose you have a 10+4 scheme and two servers and 12 drives in each server. You'll be able to lose 4 drives, yes, but if either server goes down, you'll not be able to access your data because each server will have 7 fragments (on seven disks). However, if you had 6 servers with 4 drives each, for the same total of 24 drives, you could lose four drives, like the other situation, but you could also lose up to two servers and still be able to read your data[3].

Another consideration is how much overhead you want to have. Increasing the data segments lowers the overhead used, but increasing the parity segments improves your durability and availability (up to the limits of your physical hardware failure domains).

Finally, and probably most simply, you'll want to take into account the increased CPU and network cost for a particular EC scheme. A 3x replica write needs 3 network connections, and a read needs 1. For an EC policy, a write needs k+m connections, and a read needs k. If you're using something really large like an 18+3 scheme, you're looking at a 7x overhead in network requirements when compared to a 3x replica policy. The increased socket management and packet shuffling can add significant burden to your proxy servers[4]. Good news on the CPU though. The EC algorithms are old and well tuned, especially when using libraries like erasure or isa-l, and CPUs are really fast. Erasure code policies do not add significant overhead from the encode/decode steps.

So, in summary, it's complicated, there's isn't a "right" answer, and it depends a lot on everything else about your cluster. But you've got this! You'll do great, and keep asking questions.

I hope all this helps.

--John



[1] At a high level, it's fairly intuitive that a 2+2 scheme is very different than a 10+2 scheme, even though they both have 2 parity segments and can survive the loss of any two segments.
[2] "probably", because it depends a lot on your specific situation.
[3] The fragments are distributed across the servers, so 14 fragments across 6 servers means that some servers have 2 fragments and some have 3. If you're "lucky" the two files servers would each have 2 fragments, and you'd still be able to read your data.
[4] Similarly, the EC reconstructor process needs to do much more work, when compared to replication, when it discovers a missing fragment.






On 4 Apr 2018, at 2:12, Mark Kirkwood wrote:

> ...hearing crickets - come on guys, I know you have some thoughts about this :-) !
>
>
> On 29/03/18 13:08, Mark Kirkwood wrote:
>> Hi,
>>
>> We are looking at implementing EC Policies with similar durability to 3x
>> replication. Now naively this corresponds to m=2 (using notation from
>> previous thread). However we could take the opportunity to 'do better'
>> and use m=3 or 4. I note that m=4 seems to be used in some of the Swift
>> documentation. I'd love to get some guidance about how to decide on the
>> 'right amount' of parity!
>>
>> Cheers
>>
>> Mark
>>
>>
>>
>> _______________________________________________
>> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>> Post to     : openstack at lists.openstack.org
>> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
>
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20180404/a64e65b4/attachment.sig>


More information about the Openstack mailing list