<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Good day,<br>
<br>
<h1>Object ring's durability breach.</h1>
<h2>Intoduction</h2>
I asked my friend, would he prefer to find occasionally $100 on the
street once, or $1 one hundred times per year. He responded that
$100 would be better.<br>
I asked him another question, would he prefer to loose $100 once, or
$1 one hundred time in a year. Response was opposite - loosing
frequently $1 is more desirable.<br>
The similar scenario happens in Swift rings when it comes to
durability.<br>
<br>
Here is output of sample ring, containing 2⁶ partition power, 3
replicas 3 zones.<br>
<font color="#999999"><small># swift-ring-builder object.builder
create 6 3 1<br>
swift-ring-builder object.builder add z0-127.0.0.1:6001/srv1:1_
100<br>
swift-ring-builder object.builder add z0-127.0.0.1:6002/srv2:1_
100<br>
swift-ring-builder object.builder add z0-127.0.0.1:6003/srv3:1_
100<br>
swift-ring-builder object.builder add z0-127.0.0.1:6004/srv4:1_
100<br>
swift-ring-builder object.builder add z1-127.0.0.1:6005/srv5:1_
100<br>
swift-ring-builder object.builder add z1-127.0.0.1:6006/srv6:1_
100<br>
swift-ring-builder object.builder add z1-127.0.0.1:6007/srv7:1_
100<br>
swift-ring-builder object.builder add z1-127.0.0.1:6008/srv8:1_
100<br>
swift-ring-builder object.builder add z2-127.0.0.1:6009/srv9:1_
100<br>
swift-ring-builder object.builder add
z2-127.0.0.1:60010/srv10:1_ 100<br>
swift-ring-builder object.builder add
z2-127.0.0.1:60011/srv11:1_ 100<br>
swift-ring-builder object.builder add
z2-127.0.0.1:60012/srv12:1_ 100<br>
swift-ring-builder object.builder rebalance<br>
<br>
root@myhost:/etc/swift# swift-ring-builder object.builder<br>
object.builder, build version 12<br>
64 partitions, 3 replicas, 3 zones, 12 devices, 0.00 balance<br>
The minimum number of hours before a partition can be reassigned
is 1<br>
Devices: id zone ip address port name weight
partitions mirror part. balance mirror meta<br>
0 0 127.0.0.1 6001 srv1
100.00 16 0 0.00 1 <br>
1 0 127.0.0.1 6002 srv2
100.00 16 0 0.00 1 <br>
2 0 127.0.0.1 6003 srv3
100.00 16 0 0.00 1 <br>
3 0 127.0.0.1 6004 srv4
100.00 16 0 0.00 1 <br>
4 1 127.0.0.1 6005 srv5
100.00 16 0 0.00 1 <br>
5 1 127.0.0.1 6006 srv6
100.00 16 0 0.00 1 <br>
6 1 127.0.0.1 6007 srv7
100.00 16 0 0.00 1 <br>
7 1 127.0.0.1 6008 srv8
100.00 16 0 0.00 1 <br>
8 2 127.0.0.1 6009 srv9
100.00 16 0 0.00 1 <br>
9 2 127.0.0.1 60010 srv10
100.00 16 0 0.00 1 <br>
10 2 127.0.0.1 60011 srv11
100.00 16 0 0.00 1 <br>
11 2 127.0.0.1 60012 srv12
100.00 16 0 0.00 1 </small></font><br>
Zone 0:<br>
Server 1: [0, 7, 11, 15, 17, 23, 25, 28, 33, 37, 42, 47, 49, 54, 59,
60]<br>
Server 2: [1, 5, 8, 13, 18, 20, 26, 29, 35, 36, 40, 46, 48, 55, 57,
63]<br>
Server 3: [2, 4, 10, 12, 16, 22, 24, 31, 34, 38, 43, 45, 51, 52, 58,
61]<br>
Server 4: [3, 6, 9, 14, 19, 21, 27, 30, 32, 39, 41, 44, 50, 53, 56,
62]<br>
Zone 1:<br>
Server 5: [0, 6, 9, 12, 16, 20, 25, 28, 32, 38, 43, 45, 48, 54, 58,
61]<br>
Server 6: [1, 4, 8, 13, 18, 22, 26, 31, 34, 36, 42, 46, 50, 55, 59,
60]<br>
Server 7: [2, 7, 10, 14, 17, 23, 27, 29, 33, 39, 41, 44, 51, 52, 56,
63]<br>
Server 8: [3, 5, 11, 15, 19, 21, 24, 30, 35, 37, 40, 47, 49, 53, 57,
62]<br>
Zone 2:<br>
Server 9: [0, 6, 9, 13, 19, 22, 25, 30, 33, 36, 40, 46, 51, 53, 57,
62]<br>
Server 10: [1, 4, 10, 15, 17, 21, 27, 29, 32, 39, 43, 47, 50, 55,
58, 61]<br>
Server 11: [2, 5, 11, 14, 18, 23, 26, 31, 34, 37, 41, 44, 48, 54,
59, 63]<br>
Server 12: [3, 7, 8, 12, 16, 20, 24, 28, 35, 38, 42, 45, 49, 52, 56,
60]<br>
<br>
Each server keeps list of partitions in random order, and this does
great impact on durability. When it takes place, lost piece is small
part comparing to overall data (likely 1 partition).<br>
<br>
Better way to have ring looks like this:<br>
Zone 0:<br>
Server 1: [0, 7, 11, 15, 17, 23, 25, 28, 33, 37, 42, 47, 49, 54, 59,
60]<br>
Server 2: [1, 5, 8, 13, 18, 20, 26, 29, 35, 36, 40, 46, 48, 55, 57,
63]<br>
Server 3: [2, 4, 10, 12, 16, 22, 24, 31, 34, 38, 43, 45, 51, 52, 58,
61]<br>
Server 4: [3, 6, 9, 14, 19, 21, 27, 30, 32, 39, 41, 44, 50, 53, 56,
62]<br>
Zone 1:<br>
Server 5: [0, 7, 11, 15, 17, 23, 25, 28, 33, 37, 42, 47, 49, 54, 59,
60]<br>
Server 6: [1, 5, 8, 13, 18, 20, 26, 29, 35, 36, 40, 46, 48, 55, 57,
63]<br>
Server 7: [2, 4, 10, 12, 16, 22, 24, 31, 34, 38, 43, 45, 51, 52, 58,
61]<br>
Server 8: [3, 6, 9, 14, 19, 21, 27, 30, 32, 39, 41, 44, 50, 53, 56,
62]<br>
Zone 2:<br>
Server 9: [0, 7, 11, 15, 17, 23, 25, 28, 33, 37, 42, 47, 49, 54, 59,
60]<br>
Server 10: [1, 5, 8, 13, 18, 20, 26, 29, 35, 36, 40, 46, 48, 55, 57,
63]<br>
Server 11: [2, 4, 10, 12, 16, 22, 24, 31, 34, 38, 43, 45, 51, 52,
58, 61]<br>
Server 12: [3, 6, 9, 14, 19, 21, 27, 30, 32, 39, 41, 44, 50, 53, 56,
62]<br>
<br>
All partitions are aligned to servers. In this case probability to
loose data is much lower, but impact is bigger - we loose all 16
partitions.<br>
<br>
It is proved that aligned partitions considerably better durability
than random (see next section).<br>
<br>
Actually RingBuilder._initial_balance() do balancing aligned way.
But later RingBuilder._reassign_parts() screw up everything.<br>
<br>
<h2>Demonstration of difference between random and aligned
partitions.</h2>
Here is example with 12 partitions, 3 zones, and 4 servers per each
zone (for simplicity).<br>
<br>
<img alt="12partitions"
src="cid:part1.06080901.00070504@nexenta.com" height="409"
width="690"><br>
<br>
Each partition represented with different color. 12 servers (disks)
are spread in 3 zones. Left part demonstrates aligned partitions.
Right part of the picture demonstrates randomly assigned partitions.<br>
<br>
Consider one disk failes in each zone. Result will be different for
left and right cases.<br>
It is 1/64 (¼ * ¼ * ¼) for left side (aligned partitions), and 12/64
(3 * ¾ * ¼) for right side (randomly assigned partitions).<br>
Aligned example have greater impact (always 3 lost partitions).
Random example will generally loose 1 partition, but it will happen
12 times more often.<br>
<b>Overally left example (aligned) is 4 times more durable, than
right example (random).<br>
</b> <br>
<h2>Summary.</h2>
Rebalancing algorithm must take in account alignment of partitions
among servers. Without such effort durability of Cloud degrades
significantly and can't be comparable to regular mirror with 3
copies.<br>
While zones try to improve availability of data, new technique may
address durability problem. It may have new terminology, like
"alignment".<br>
<br>
<br>
Anatoly Legkodymov.<br>
</body>
</html>