[openstack-dev] [swift] eventual consistency and lost updates

John Dickinson me at not.mn
Fri Jun 27 17:06:09 UTC 2014

Great questions. I'll answer inline.

On Jun 27, 2014, at 2:54 AM, Eoghan Glynn <eglynn at redhat.com> wrote:

> Hi Swiftsters!
> A basic question about swift eventual- versus strong-consistency.
> The context is potentially using swift as a store for metric data.
> Say datapoints {p1, p2, ..., p_i} are stored in a swift object.
> Presumably these data are replicated across the object ring in
> an eventually consistent way.

Correct, but let me provide some more depth for those watching at home.

When an object is written into Swift, multiple replicas are durably written across different failure domains before the response returns a success to the client. For a three replica cluster, three writes are attempted, and success is only returned once two or three durable writes (ie flush all the way to disk) are successful).

Swift choses those three replica locations (ie drives) based on the ring. However, if there is a failure condition in the cluster, one or more of those three locations may not be available. In that case, Swift deterministically chooses other drives in the cluster until it finds three that are available for writing. Then the write happens and success or failure is returned to the client depending on how many writes were successful.

Consider the following example:

Time T0:

PUT objectA (content hash H1), and it gets written to drives 1, 2, and 3.

Time T1:

The server that drive 3 is plugged in to fails

Time T2:

PUT objectA (content hash H2), and now it gets written t drives 1, 2, and 4

Time T3:

Access to the server that drive is plugged in to is restored.

At this point we have the following distribution of data:

drive1: content H2
drive2: content H2
drive3: content H1
drive4: content H2

Time T4

GET for objectA -> Swift will (by default) choose a random one of drive 1, 2, and 3 and return that data to the client.

You can see how it's possible for Swift to return the old copy of objectA (a 1/3 chance).

Swift's replication process is continuously running in the background on Swift servers. On each of the servers that the drives 1-4 are respectively plugged in to, when objectA is found locally, it will query drives 1-3 to ensure than the right data is in the right place.

Replication will ensure that drive4's objectA with content H2 is removed (once it's known that the right data is on each of drives 1-3), and replication will also ensure that drive3's objectA with content H1 is replaced with objectA with content H2.

The conflict resolution here is last write wins.

(Note that the above example is only for the failure scenario where a server has failed, is busy, or is otherwise incapable of responding to requests. Other failure scenarios like a drive failure--which is more common than server failure--have slightly different, but similar, behaviors. This example is provided as a simple one for explanation purposes.)

Back in January 2013, I gave a full talk on this and other failure scenarios in Swift. The recording is at https://www.youtube.com/watch?v=mKZ7kDDPSIU

> Say I want to read back this blob and update it with a further
> datapoint {p_(i+1)}.
> But eventual consistency tells me that I may end up folding my
> new datapoint into an older version of the object:
>  {p1, p2, ..., p_(i-1)}
> instead of the expected:
>  {p1, p2, ..., p_i}
> i.e. the classic lost update problem.
> So my basic questions are:
> * is read-then-update an acknowledged anti-pattern for swift?

Yes. You cannot guarantee that nothing is happening in between the read and write. ie you can't perform two Swift API calls in an atomic transaction.

> * if so, what are the recommended strategies for managing non-
>   static data in swift?
>   - e.g. write latest to a fresh object each time, do an async
>     delete on the old
>   - store a checksum elsewhere and detect the stale-read case

Both of these are quite acceptable, and it depends on what you are able to do on the client side. Yes, the checksum for the content of the object is stored with and returned with the object (in the ETag header).

As an extra thing to check out, Netflix is able to work around the eventual consistency in S3 by using a consistent DB for tracking what should be where. There project to do this is called S3mper, and it's quite possible to use the same strategy for Swift.

> * are the object metadata (ETag, X-Timestamp specifically)
>   actually replicated in the same eventually-consistent way as
>   the object content?
>   (the PUT code[1] suggests data & metadata are stored together,
>    but just wanted to be sure I'm reading that correctly)

The metadata for the object is stored with the object. Metadata and object content are _not_ replicated separately. They are always kept together.

> Thanks,
> Eoghan
> [1] https://github.com/openstack/swift/blob/master/swift/obj/server.py#L441-455
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140627/b6cd8857/attachment.pgp>

More information about the OpenStack-dev mailing list