John Dickinson me at not.mn
Wed Aug 20 04:40:11 UTC 2014

Not quite. Let's walk through an example:

I have a small ring:

$ swift-ring-builder ./object.builder
./object.builder, build version 4
64 partitions, 3.000000 replicas, 1 regions, 2 zones, 4 devices, 0.00 balance
The minimum number of hours before a partition can be reassigned is 0
Devices:    id  region  zone      ip address  port  replication ip  replication port      name weight partitions balance meta
             0       1     1  6010              6010        d1   1.00         48    0.00
             1       1     1  6020              6020        d2   1.00         48    0.00
             2       1     2  6030              6030        d3   1.00         48    0.00
             3       1     2  6040              6040        d4   1.00         48    0.00

4 devices, 3 replicas, part power of 6. The part power of 6 means that I have 64 possible partitions. The part power is simply the number of prefix bits of the result of a call to md5(). Hash something, and take the first 6 bits and that's the partition it's in. Because of the way md5 works, you get a nice splaying across the 64 partitions.

Now, with 64 partitions and 3 replicas, I have 192 total partition replicas to place on the four devices. Since all devices are weighted evenly ("1.00" in the example), I end up with an even placement and 48 partitions assigned to each drive (2**6*3/4=48). Now you've got a balanced ring and each partition (ie of the 64 partitions) is placed on 3 drives. For more details on that, see the earlier referenced video.

Suppose I have a Swift account called APP_awesome. (Remember that Swift's accounts are storage areas, not necessarily 1:1 with user identities.) In that account, I was to put things, so I create a container called "things". Now I have a place to put all of my awesome things. The first awesome thing I want to put is backup.tgz. Where will it go in the cluster?

$ swift-get-nodes object.ring.gz APP_awesome/things/backup.tgz

Account  	APP_awesome
Container	things
Object   	backup.tgz

Partition	51
Hash     	cc4e888bfad168f782897e32a892c4ef

Server:Port Device d1
Server:Port Device d4
Server:Port Device d2
Server:Port Device d3	 [Handoff]

How did `swift-get-nodes` find the partition? First, it took the entire object name ("APP_awesome/things/backup.tgz"), then added in the secret prefix and suffix from swift.conf (basically just salts to prevent attackers from filling up one partition), and then hashed that with md5. The resulting hash, in hex, is "cc4e888bfad168f782897e32a892c4ef". The raw digest of this hash value (a 16 byte string) is unpacked (as big-endian unsigned ints) and then right-shifted by 26 (ie 32-6) so we get the first 6 bits. The resulting number is the partition. In this case, "51".

>>> key = md5(prefix + '/APP_awesome/things/backup.tgz' + suffix).digest()
>>> struct.unpack_from('>I', key)[0] >> 26

The resulting partition is the index in an array (serialized in object.ring.gz). The value at that index is the 3 nodes (ie drives) that are responsible for storing data at that partition. Find the IP, port, and mount point (name) for those drives, and you're ready to read or write the data.

As a clarification point, the "Handoff" node listed above is where the data will go if one of the primary drives fails. There is only one handoff because there are only 4 total drives in this ring.


A few things to note here. First, the size of the object has nothing to do with the resulting location. The more objects you store, the more evenly your drives will fill up (because md5 has good, even splaying). Second, the cost of doing all this computation is basically the cost of hashing the object name, and we know that (1) the name is bounded in length and (2) md5 is fast (enough). Therefore, ring lookups are cheap, and we don't have to read all the object data into memory before finding where it lives in the cluster.

Now, let's move back to your original numbers. You have 50 drives. The weight doesn't matter at this point, but the best-practice guideline is to set the weight to the number of GB on the drive (eg 3TB == 3000).

You want about 100 partitions per drive, so we need to find a part power that gives us that.

Find the smallest x such that (2**x * 3) / 50 > 100.

2**x > 166
math.log(166, 2) = 7.375

Therefore use a part power of 8. And once you have the 50 devices added to the ring, Swift will go through the math above to find the proper placement of each object.

Let me know if you have further questions.


On Aug 19, 2014, at 7:15 PM, Brent Troge <brenttroge2016 at gmail.com> wrote:

> Yeah I have watched that multiple times over the weekend, and has helped very much.
> So with respect to my example numbers, I am guessing that each partition will land on every '41538374868278621028243970633760768' of the md5 space.
> 2^(128 - 13)
> or 
> 2^(128)/8192
> Thanks! 
> On Tue, Aug 19, 2014 at 8:00 PM, John Dickinson <me at not.mn> wrote:
> https://swiftstack.com/blog/2012/11/21/how-the-ring-works-in-openstack-swift/ is soemthing that should be able to give you a pretty complete overview of how the ring works in Swift and how data placement works.
> Let me know if you have more questions after you watch that video.
> --John
> On Aug 19, 2014, at 5:34 PM, Brent Troge <brenttroge2016 at gmail.com> wrote:
> >
> > Excuse this question and for lack of basic understanding. I dropped from school at 8th grade, so everything is basically self taught. Here goes.
> >
> > I am trying to figure out where each offset/partition is placed on the ring.
> >
> >
> > So If I have 50 drives with a weight of 100 each I come up with the below part power
> >
> > part power = log2(50 * 100) = 13
> >
> > Using that I then come up with the amount of partitions.
> >
> > partitions = 2^13 =  8192
> >
> > Now here is where my ignorance comes into play. How do I use these datapoints to determine where each offset is on the ring?
> >
> > I then guess that for each offset they will have a fixed range of values that map to that partition.
> >
> > So for example, for offset 1, all object URL md5 hashes that have a decimal value of 0 through 100 will go here(i just made up the range 0 through 100, i have no idea what the range would be with respect to my given part-power, drive, etc).
> > _______________________________________________
> > Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> > Post to     : openstack at lists.openstack.org
> > Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20140819/5f3fb1b6/attachment.sig>

More information about the Openstack mailing list