[Openstack] [Swift] Unexplained 404s

Clay Gerrard clay.gerrard at gmail.com
Mon May 23 23:20:08 UTC 2016


On Mon, May 23, 2016 at 1:49 PM, Shrinand Javadekar <shrinand at maginatics.com
> wrote:

>
> If objects are placed on different devices than the computed ones,
> they will be unavailable until the replication places them at the
> correct location.


This part doesn't sound quite right to me, but the transaction logs will
tell.

My guess is that if the nodes the data is getting written too (primary or
handoff) are so overloaded they're getting timed out - it's possible after
request_node_count checks on to the backend storage nodes the response
still ends up looking like a 404 because none of the nodes that were able
to respond had the data.

Imagine if the original PUT all three primaries fail their connect Timeout
and so the request is streamed to first three handoffs only two of which
complete successfully.  The storage nodes responded [Timeout, Timeout,
Timeout, 201, 201, 503*] - so it's written successfully to two of the three
handoff nodes (*ChunkWriteTimeout on the third, remember this cluster is
terribly overloaded).

Then on GET the response might be [404, 404, 404, Timeout, Timeout, 404,
404, 404, 404] - the first primaries miss of course, but if the first two
handoffs then timeout, it doesn't matter how many other handoff nodes are
checked - the response has to be 404 - the two places we wrote the data are
both so hammered under load they can't respond.

But it's not because replication needs to "move" anything - yes, it will
eventually get moved from the handoffs to the primaries, but in the
meantime the read path is going to use the same stable handoff pattern as
the write path.

... but that's just my guess, it's a rather curious failure mode, the
transaction logs would have all the details.  Happy hunting!

-Clay



> And this could take a really long time. Is that
> right?
>
> -Shri
>
> On Fri, May 20, 2016 at 4:53 PM, Mark Kirkwood
> <mark.kirkwood at catalyst.net.nz> wrote:
> > On 21/05/16 05:27, Shrinand Javadekar wrote:
> >>
> >> Hi,
> >>
> >> I am troubleshooting a test setup where Swift returned a 201 for
> >> objects that were put in it but later when I tried to read it, I got
> >> back 404s.
> >>
> >> The system has been under load. I see lots of connection errors,
> >> lock-timeouts, etc. However, I am not sure if ever Swift should be
> >> returning a 404.
> >>
> >> I tried simulating some of these on a different setup and always got
> >> the expected response (which wasn't a 404).
> >>
> >> - Stopped memcached and did a blob get. This returned a 401 Unauthorized
> >> error.
> >>
> >> - Stopped the object-server and did a blob get. This returned a 503
> >> internal server error.
> >>
> >> - Stopped the container-server. This didn't have any effect. The
> >> container-server is not looked during every GET.
> >> - Stopped the account-server. Same result as container-server.
> >>
> >> Any ideas on when Swift might return a 404 even though the object was
> >> successfully written?
> >>
> >
> > I addition to what John said, I've seen that sort of behaviour on slow or
> > heavily loaded systems (e.g):
> >
> > - write an object (successful)
> > - immediately try to read it (404)
> > - a few minutes later try to read it (successful)
> >
> > This is because the replication step can take some time to place the
> object
> > on all the devices where it is supposed to live (i.e a read may not
> always
> > look at where the object has just been written).
> >
> > Cheers
> >
> > Mark
> >
> >
> > _______________________________________________
> > Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> > Post to     : openstack at lists.openstack.org
> > Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
> _______________________________________________
> Mailing list:
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe :
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20160523/6f8a5bfe/attachment.html>


More information about the Openstack mailing list