[Openstack] [Swift] Object replication...

Shyam Prasad N nspmangalore at gmail.com
Wed Dec 14 05:28:29 UTC 2016


Thanks John, for that excellent description. (Perhaps this should make it
into one of the FAQs page) :)

I didn't know that swift was an eventually consistent object storage
system. With the (replica_count/2)+1 synchronous PUT model, I always
thought swift was strictly consistent.

So going by what you're saying, even with a replica count of 3, the storage
system will still be eventually consistent, not strictly. I can only
increase my chances of getting consistent data by increasing the disk count
(so as to even out the load), but I cannot be absolutely certain.

Will I be able to somehow achieve strict consistency model on a swift
cluster then? Will reducing the replica count to 1 help? Will that ensure
that every overwrite will update the object location mapping? Or are there
still chances that there is a handoff when the mapped location is busy, and
another get is mapped to an older version?

Please note that I'm okay to sacrifice high availability, if that ensures
that the data is strictly consistent.

Regards,
Shyam


On Dec 13, 2016 22:46, "John Dickinson" <me at not.mn> wrote:

On 13 Dec 2016, at 0:21, Shyam Prasad N wrote:

Hi,

I have an openstack swift cluster with 2 nodes, and a replication count of
2.
So, theoretically, during a PUT request, both replicas are updated
synchronously. Only then the request will return a success. Please correct
me if I'm wrong on this.

I have a script that periodically does a PUT to a small object with some
random data, and then immediately GETs the object. On some occasions, I'm
getting older data during the GET.

Is my expectation above correct? Or is there some other setting needed to
make the replication synchronous?

This is an interesting case of both Swift and your expectations being
correct. But wait! How can that be when they seem to be at odds? Therein
lies the fun* of Distributed Systems. Yay.

(*actually, not that much fun)

Ok, here's how it works. I'm assuming you have more than one hard drive on
each of your two servers. When Swift gets the PUT request, the proxy will
determine where the object data is supposed to be in the cluster. It does
this via hashing and ring lookups (this is deterministic, but the details
of that process aren't important here). The proxy will look for <replica
count> places to put the data. In your case, this is 2. Because of the way
the ring works, it will look for one drive on each of your two servers
first. It will not put the data on two drives on one server. So in the
Happy Path, the client makes a PUT request, the proxy sends the data to
both replicas, and after both have been fsync'd, the client gets a 201
Created response. [1]

This is well and good, and the greatest part about it is that Swift can
guarantee read-your-creates. That is, when you create a new object, you are
immediately able to read it. However, what you describe is slightly
different. You're overwriting an existing object, and sometimes you're
getting back the older version of the object on a subsequent read. This is
normal and expected. Read on for why.

The above process is the Happy Path for when there are no failures in the
system. A failure could be a hardware failure, but it could also be some
part of the system being overloaded. Spinning drives have very real
physical limits to the amount of data they can read and write per unit
time. An overloaded hard drive can cause a read or write request to time
out, thus becoming a "failure" in the cluster.

So when you overwrite an object in Swift, the exact same process happens:
the proxy finds the right locations, sends the data to all those locations,
and returns a success if a quorum successfully fsync'd the data to disk.

However, what happens if there's a failure?

When the proxy determines the correct location for the object, it chooses
what we call "primary" nodes. These are the canonical locations where the
data is supposed to be right now. All the other drives in the cluster are
called "handoff" nodes. For a given object, some nodes (<replica count> of
them, to be exact) are primary nodes, and all the rest in the cluster are
handoffs. For another object, a different set of nodes will be primary, and
all the rest in the cluster are handoffs. This is the same regardless of
how many replicas you're using or how many drives you have in the cluster.

So when there's a failure in the cluster and a write request comes in, what
happens? Again, the proxy finds the primary nodes for the object and it
tries to connect to them. However, if one (or more) can't be connected to,
then the proxy will start trying to connect to handoff nodes. After the
proxy gets <replica count> successful connections, it sends the data to
those storage nodes, the data is fsync'd, and the client gets a successful
response code (assuming at least a quorum were able to be fsync'd). Note
that in your case with two replicas, if the primary nodes were extra busy
(e.g. serving other requests) or actually failing (drives do that, pretty
often, in fact), then the proxy will choose a handoff location to write the
data. This means that even when the cluster has issues, your writes are
still completely durably written.[2]

The read request path is very similar: primary nodes are chosen, one is
selected at random, if the data is there, it's returned. If the data isn't
there, the next primary is chosen, etc etc.

Ok, we're finally able to get down to answering your question.

Let's assume you have a busy drive in the cluster. You (over)write your
object, the proxy looks up the primary nodes, sees that one is busy (i.e.
gets a timeout), chooses a handoff location, writes the data, and you get a
201 response. Since this is an overwrite, you've got an old version of the
object on one primary, a new version on another primary, and a new version
on a handoff node. Now you do the immediate GET. The proxy finds the
primary nodes, randomly chooses one, and oh no! it chose the one with the
old data. Since there's data there, that version of the object gets
returned to the client, and you see the older version of the data.
Eventually, the background replication daemon will ensure that the old
version on the primary is updated and the new version on the handoff is
removed.

The way all of this works is a deliberate design choice in Swift. It means
that when there are failures in the cluster, Swift will continue to provide
durability and availability for your data, despite sometimes providing a
stale view of what's there. The technical term for this is "eventual
consistency". The other model you can have in distributed systems is called
"strong consistency". In strong consistency, you get durability but not
availability (i.e. you'd get an error instead of success on the above
scenario). However, you also won't get stale data. Neither is particularly
better than the other; it really comes down to the use case. In general,
eventually consistent systems can scale larger than strongly consistent
systems, and eventually consistent systems can more easily provide
geographic dispersion of your data (which Swift can do). Basically, the
difference comes down to "what happens when there are failures in the
system?".

I hope that helps. Please ask if you have other questions.

[1] To generalize this to *any* number of replicas, the proxy concurrently
writes to <replica count> places and returns a success after a quorum have
successfully been written. It's important to note that the <replica count>
number of replicas will be written at the time of the request; it's *not* a
write-once-and-replicate-later process. A quorum is determined by the
replica count: it's half plus one for odd replica counts and half for even
replicas. So in your case with 2 replicas, a quorum is 1. If you had 3
replicas, the quorum is 2.

[2] Thinking about this more, you'll see that this leads to a really cool
scaling property of Swift: the system gets better (wrt performance,
durability, and availability) as it gets bigger.

--John

-- 
-Shyam
_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to : openstack at lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20161214/8d07e257/attachment.html>


More information about the Openstack mailing list