[openstack-dev] [swift] - question about statsd messages and 404 errors

Samuel Merritt sam at swiftstack.com
Fri Jul 25 22:46:38 UTC 2014


On 7/25/14, 4:58 AM, Seger, Mark (Cloud Services) wrote:
> I’m trying to track object server GET errors using statsd and I’m not
> seeing them.  The test I’m doing is to simply do a GET on an
> non-existent object.  As expected, a 404 is returned and the object
> server log records it.  However, statsd implies it succeeded because
> there were no errors reported.  A read of the admin guide does clearly
> say the GET timing includes failed GETs, but my question then becomes
> how is one to tell there was a failure?  Should there be another type of
> message that DOES report errors?  Or how about including these in the
> ‘object-server.GET.errors.timing’ message?

What "error" means with respect to Swift's backend-server timing metrics 
is pretty fuzzy at the moment, and could probably use some work.

The idea is that object-server.GET.timing has timing data for everything 
that Swift handled successfully, and object-server.GET.timing.errors has 
timing data for things where Swift failed.

Some things are pretty easy to divide up. For example, 200-series status 
code always counts as success, and 500-series status code always counts 
as error.

It gets tricky in the 400-series status codes. For example, a 404 means 
that a client asked for an object that doesn't exist. That's not Swift's 
fault, so that goes into the success bucket (object-server.GET.timing). 
Similarly, a 412 means that a client set an unsatisfiable precondition 
in the If-Match, If-None-Match, If-Modified-Since, or 
If-Unmodified-Since headers, and Swift correctly determined that the 
requested object can't fulfill the precondition, so that one goes in the 
success bucket too.

However, there are other status codes that are more ambiguous. Consider 
409; the object server responds with 409 if the request's X-Timestamp is 
less than the object's X-Timestamp (on PUT/POST/DELETE). You can get 
this with two near-simultaneous POSTs:

   1. request A hits proxy; proxy assigns X-Timestamp: 1406316223.851131
   2. request B hits proxy; proxy assigns X-Timestamp: 1406316223.851132
   3. request B hits object server and gets 202
   4. request A hits object server and gets 409

Does that error count as Swift's fault? If the client requests were 
nearly simultaneous, then I think not; there's always going to be *some* 
delay between accept() and gettimeofday(). On the other hand, if one 
proxy server's time is significantly behind another's, then it is 
Swift's fault.

It's even worse with 400; sometimes it's for bad paths (like asking an 
object server for /<partition>/<account>/<container>; this can happen if 
the administrator misconfigures their rings), and sometimes it's for bad 
X-Delete-At / X-Delete-After values (which are set by the client).

I'm not sure what the best way to fix this is, but if you just want to 
see some error metrics, unmount a disk to get some 507s.



More information about the OpenStack-dev mailing list