[Openstack] [Swift] Unexplained 404s

Clay Gerrard clay.gerrard at gmail.com
Tue May 24 20:22:14 UTC 2016


On Tue, May 24, 2016 at 11:59 AM, Shrinand Javadekar <
shrinand at maginatics.com> wrote:

>
> I found the object written into the second handoff node.
>

Are you running only a single replica!?  Was the object data *only* on the
second handoff?!  If the original PUT request did not return success it's
much more likely that you would have an unspecified behavior on the read
path.


>
> 1. So when the replicator catches up, it will move the object back to
> the correct location. Is that right?
>

The read path will find the object on any primary or any handoff location.
The replicator *will* copy the data files to the primary and delete it from
the handoff once it's successfully in sync.  But GETs for the object will
be able to find the object during that entire process.  Having data written
to a handoff location does not mean it is unaccessible - quite the opposite
- stable handoff ordering is the mechanism that enables data to be
accessible during failure of primary storage devices.

2. Is there a way to disable handoffs?
>

No, nor would you want to.  A failure or unavailability (provably
indistinguishable in distributed systems) of a primary storage device
should not prevent an AP system from accepting the write (the "available"
part) because it must service reads for that data from the non-primary
locations where the data was written (the "partition tolerant" part).  This
is an old, stable and well understood behavior of the Swift architecture -
it's fascinating that you're delving into issues related to this process -
it's possible you're exercising the system in a very unique way (single
replica?) or under *extreme* duress (artificial benchmarking workload
saturating the public facing network without account for write
amplification on the cluster facing network, or request rates totally out
of line with the iops available on the storage devices in the cluster?).

I hope you're able to dig into the transaction logs and provide more
information as we can currently only make conjectures about what you might
have observed.

The real question to be answered from the transaction logs is what requests
were made, and what responses were logged on the storage nodes when the
user facing request returned 404.  Can you answer if the node(s) that had
the data for some reason unable to respond when you observed the 404?  Find
a txn_id on the request that returned 404 after a 201* and capture all log
lines from all nodes matching that txn_id.  GL!

-Clay

* bonus points if you can find the txn_id for the 201 and capture those
logs lines as well
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20160524/e9b97bfe/attachment.html>


More information about the Openstack mailing list