<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Tue, May 24, 2016 at 11:59 AM, Shrinand Javadekar <span dir="ltr"><<a href="mailto:shrinand@maginatics.com" target="_blank">shrinand@maginatics.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

I found the object written into the second handoff node.<br></blockquote><div> </div><div>Are you running only a single replica!?  Was the object data *only* on the second handoff?!  If the original PUT request did not return success it's much more likely that you would have an unspecified behavior on the read path.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

1. So when the replicator catches up, it will move the object back to<br>

the correct location. Is that right?<br></blockquote><div><br></div><div>The read path will find the object on any primary or any handoff location.  The replicator *will* copy the data files to the primary and delete it from the handoff once it's successfully in sync.  But GETs for the object will be able to find the object during that entire process.  Having data written to a handoff location does not mean it is unaccessible - quite the opposite - stable handoff ordering is the mechanism that enables data to be accessible during failure of primary storage devices.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

2. Is there a way to disable handoffs?<br></blockquote><div><br></div><div>No, nor would you want to.  A failure or unavailability (provably indistinguishable in distributed systems) of a primary storage device should not prevent an AP system from accepting the write (the "available" part) because it must service reads for that data from the non-primary locations where the data was written (the "partition tolerant" part).  This is an old, stable and well understood behavior of the Swift architecture - it's fascinating that you're delving into issues related to this process - it's possible you're exercising the system in a very unique way (single replica?) or under *extreme* duress (artificial benchmarking workload saturating the public facing network without account for write amplification on the cluster facing network, or request rates totally out of line with the iops available on the storage devices in the cluster?).</div><div><br></div><div>I hope you're able to dig into the transaction logs and provide more information as we can currently only make conjectures about what you might have observed.</div><div><br></div><div>The real question to be answered from the transaction logs is what requests were made, and what responses were logged on the storage nodes when the user facing request returned 404.  Can you answer if the node(s) that had the data for some reason unable to respond when you observed the 404?  Find a txn_id on the request that returned 404 after a 201* and capture all log lines from all nodes matching that txn_id.  GL!</div><div><br></div><div>-Clay</div><div><br></div><div>* bonus points if you can find the txn_id for the 201 and capture those logs lines as well</div></div></div></div>