[Openstack] [Swift] Object replication...

John Dickinson me at not.mn
Tue Dec 13 17:16:10 UTC 2016



On 13 Dec 2016, at 0:21, Shyam Prasad N wrote:

> Hi,
>
> I have an openstack swift cluster with 2 nodes, and a replication count of
> 2.
> So, theoretically, during a PUT request, both replicas are updated
> synchronously. Only then the request will return a success. Please correct
> me if I'm wrong on this.
>
> I have a script that periodically does a PUT to a small object with some
> random data, and then immediately GETs the object. On some occasions, I'm
> getting older data during the GET.
>
> Is my expectation above correct? Or is there some other setting needed to
> make the replication synchronous?


This is an interesting case of both Swift and your expectations being correct. But wait! How can that be when they seem to be at odds? Therein lies the fun* of Distributed Systems. Yay.

(*actually, not that much fun)

Ok, here's how it works. I'm assuming you have more than one hard drive on each of your two servers. When Swift gets the PUT request, the proxy will determine where the object data is supposed to be in the cluster. It does this via hashing and ring lookups (this is deterministic, but the details of that process aren't important here). The proxy will look for <replica count> places to put the data. In your case, this is 2. Because of the way the ring works, it will look for one drive on each of your two servers first. It will not put the data on two drives on one server. So in the Happy Path, the client makes a PUT request, the proxy sends the data to both replicas, and after both have been fsync'd, the client gets a 201 Created response. [1]

This is well and good, and the greatest part about it is that Swift can guarantee read-your-creates. That is, when you create a new object, you are immediately able to read it. However, what you describe is slightly different. You're overwriting an existing object, and sometimes you're getting back the older version of the object on a subsequent read. This is normal and expected. Read on for why.

The above process is the Happy Path for when there are no failures in the system. A failure could be a hardware failure, but it could also be some part of the system being overloaded. Spinning drives have very real physical limits to the amount of data they can read and write per unit time. An overloaded hard drive can cause a read or write request to time out, thus becoming a "failure" in the cluster.

So when you overwrite an object in Swift, the exact same process happens: the proxy finds the right locations, sends the data to all those locations, and returns a success if a quorum successfully fsync'd the data to disk.

However, what happens if there's a failure?

When the proxy determines the correct location for the object, it chooses what we call "primary" nodes. These are the canonical locations where the data is supposed to be right now. All the other drives in the cluster are called "handoff" nodes. For a given object, some nodes (<replica count> of them, to be exact) are primary nodes, and all the rest in the cluster are handoffs. For another object, a different set of nodes will be primary, and all the rest in the cluster are handoffs. This is the same regardless of how many replicas you're using or how many drives you have in the cluster.

So when there's a failure in the cluster and a write request comes in, what happens? Again, the proxy finds the primary nodes for the object and it tries to connect to them. However, if one (or more) can't be connected to, then the proxy will start trying to connect to handoff nodes. After the proxy gets <replica count> successful connections, it sends the data to those storage nodes, the data is fsync'd, and the client gets a successful response code (assuming at least a quorum were able to be fsync'd). Note that in your case with two replicas, if the primary nodes were extra busy (e.g. serving other requests) or actually failing (drives do that, pretty often, in fact), then the proxy will choose a handoff location to write the data. This means that even when the cluster has issues, your writes are still completely durably written.[2]

The read request path is very similar: primary nodes are chosen, one is selected at random, if the data is there, it's returned. If the data isn't there, the next primary is chosen, etc etc.

Ok, we're finally able to get down to answering your question.

Let's assume you have a busy drive in the cluster. You (over)write your object, the proxy looks up the primary nodes, sees that one is busy (i.e. gets a timeout), chooses a handoff location, writes the data, and you get a 201 response. Since this is an overwrite, you've got an old version of the object on one primary, a new version on another primary, and a new version on a handoff node. Now you do the immediate GET. The proxy finds the primary nodes, randomly chooses one, and oh no! it chose the one with the old data. Since there's data there, that version of the object gets returned to the client, and you see the older version of the data. Eventually, the background replication daemon will ensure that the old version on the primary is updated and the new version on the handoff is removed.

The way all of this works is a deliberate design choice in Swift. It means that when there are failures in the cluster, Swift will continue to provide durability and availability for your data, despite sometimes providing a stale view of what's there. The technical term for this is "eventual consistency". The other model you can have in distributed systems is called "strong consistency". In strong consistency, you get durability but not availability (i.e. you'd get an error instead of success on the above scenario). However, you also won't get stale data. Neither is particularly better than the other; it really comes down to the use case. In general, eventually consistent systems can scale larger than strongly consistent systems, and eventually consistent systems can more easily provide geographic dispersion of your data (which Swift can do). Basically, the difference comes down to "what happens when there are failures in the system?".

I hope that helps. Please ask if you have other questions.

[1] To generalize this to *any* number of replicas, the proxy concurrently writes to <replica count> places and returns a success after a quorum have successfully been written. It's important to note that the <replica count> number of replicas will be written at the time of the request; it's *not* a write-once-and-replicate-later process. A quorum is determined by the replica count: it's half plus one for odd replica counts and half for even replicas. So in your case with 2 replicas, a quorum is 1. If you had 3 replicas, the quorum is 2.

[2] Thinking about this more, you'll see that this leads to a really cool scaling property of Swift: the system gets better (wrt performance, durability, and availability) as it gets bigger.


--John



>
> -- 
> -Shyam
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20161213/d4de66d0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20161213/d4de66d0/attachment.sig>


More information about the Openstack mailing list