<br><br><div class="gmail_quote">On Thu, Sep 16, 2010 at 2:26 AM, FUJITA Tomonori <span dir="ltr"><<a href="mailto:fujita.tomonori@gmail.com">fujita.tomonori@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

2010/9/14 Gregory Holt <<a href="mailto:gholt@rackspace.com">gholt@rackspace.com</a>>:<br>

<div class="im"><br>

>> Read-your-writes consistency works nicely for the iSCSI service. We<br>

>> could live with weaker consistency models though.<br>

><br>

> Swift has the small possibility you'd read older data, even just after writing newer data with the same HTTP Keep-Alive connection.<br>

><br>

> Example scenario: PUT obj(v1) goes to the three replica nodes desired (1-2-3), no problem on read; then PUT obj(v2) times out on the first replica node (x-2-3) but succeeds with two of the three saving the data, but a read that succeeds on node 1 will return obj(v1).<br>


><br>

> We have discussed making read hit all known replicas and return the greatest version, but we have to test the impact of that at scale first.<br>

<br>

</div>Can we support that optionally?  e.g. selecting a consistency model<br>

per container or object?</blockquote><div><br></div><div>In theory, but that could get complicated.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im">

> Even with greatest version support, there always is a chance that only one node could be reached on read, and that node might have older data.<br>

<br>

</div>Yeah, in such a case (nodes having the latest data are down, etc), it<br>

would be better if a client gets an I/O error explicitly. But I don't<br>

think that it's easy<br>

to guarantee that (we did for Sheepdog storage system). Getting old<br>

data is kinda silent data corruption, which could happen even with<br>

real disk.<br>

I think that we could live with that if the possibility is small (as<br>

you know, some file systems can handle such failure).</blockquote><div><br></div><div>This is the heart of CAP theorem. In the event of partitions (failures), no distributed system can guarantee that it will respond with the correct data. Choosing 'availability' means there is some chance that the data is stale/inconsistent.</div>

<div><br></div><div>When consistent data is a higher requirement than available data, then eventually consistent storage, like swift, is probably not the best choice.</div><div><br></div><div>The key is understanding what the choices are and making an appropriate choice based on system requirements.</div>

<div><br></div></div>