Open Stack

Tue Jun 24 18:41:05 UTC 2014

I've lamented for awhile that while swift/statsd provide a wealth of information, it's in a somewhat difficult to use format.  Specifically you have to connect to a socket and listen for messages.  Furthermore if you're listening, nobody else can.  I do realize there is a mechanism to send the data to graphite, but what if I'm not a graphite user OR want to look at the data at a finer granularity than is being sent to graphite?

What I've put together and would love to get some feedback on is a tool I'm calling 'statsdtee', specifically because you can configure statsd to send to the port it wants to listen on (configurable of course) and statsdtee will then process it locally AND tee it out another socket, making it possible to forward the data on to graphite and still allow local processing.

Local processing consists of calculating rolling counters and writing them to a file that looks much like most /proc entries, such as this:

$cat /tmp/statsdtee
V1.0 1403633349.159516
accaudt 0 0 0
accreap 0 0 0 0 0 0 0 0 0
accrepl 0 0 2100 0 0 0 1391 682 0 2100
accsrvr 1 0 0 0 0 2072 0
conaudt 0 0 0
conrepl 0 0 2892 0 0 0 1997 1107 0 2892
consrvr 2700 0 0 1 1 992 0
consync 541036 0 11 0 0
conupdt 0 17 17889
objaudt 0 0
objexpr 0 0
objrepl 0 0 0 0
objsrvr 117190 16325 0 43068 9 996 5 0 6904
objupdt 0 0 0 1704 0

In this format we're looking at data for account, container and object services.  There is a similar one for proxy.  The reason for the names on each line is what to report on is configurable in a conf file down to the granularity of a single line, thereby making it possible to report less information, though I'm not sure if one would really do that or not.

To make this mechanism really simple and avoid using internal timers, I'm simply looking at the time of each record and every time the value of the second changes, write out the current counters.  I could change it to every 10th of  second but am thinking that really isn't necessary.  I could also drive it off a timer interrupt, but again I'm not sure that would really buy you anything.

My peeve with /proc is you never know what  each field means and so there is a second format in which headers are included and they look like this:

$ cat /tmp/statsdtee
V1.0 1403633339.410722
#       errs pass fail
accaudt 0 0 0
#       errs cfail cdel cremain cposs_remain ofail odel oremain oposs_remain
accreap 0 0 0 0 0 0 0 0 0
#       diff diff_cap nochg hasmat rsync rem_merge attmpt fail remov succ
accrepl 0 0 2100 0 0 0 1391 682 0 2100
#       put get post del head repl errs
accsrvr 1 0 0 0 0 2069 0
#       errs pass fail
conaudt 0 0 0
#       diff diff_cap nochg hasmat rsync rem_merge attmpt fail remov succ
conrepl 0 0 2793 0 0 0 1934 1083 0 2793
#       put get post del head repl errs
consrvr 2700 0 0 1 1 976 0
#       skip fail sync del put
consync 536193 0 11 0 0
#       succ fail no_chg
conupdt 0 17 17889
#       quar errs
objaudt 0 0
#       obj errs
objexpr 0 0
#       part_del part_upd suff_hashes suff_sync
objrepl 0 0 0 0
#       put get post del head repl errs quar async_pend
objsrvr 117190 16325 0 43068 9 996 5 0 6904
#       errs quar succ fail unlk
objupdt 0 0 0 1704 0

The important thing to remember about rolling counters is as many people who wish can read them simultaneously and be assured nobody is stepping on each other since they never get zeroed!  You simply read a sample, wait awhile and read another.  The result is the change in the counters over that interval and anyone can use any interval they choose.

So how useful people think this is?  Personally I think it's very useful...

The next step is how to calculate the numbers I'm reporting.  While statsd reports a lot of timing information, none of that really fits this model as all I want are counts.  So when I see a GET timing record, I count it as 1 GET.  Seems to work so far. IS this a legitimate thing to be doing?  Feels right and from the preliminary testing I've been doing it seems pretty accurate.

One thing I've found missing is more detailed error information.  For example I can tell how many errors there were but I can't tell how many of each type there were.  Is this something that can easily be added?  I've found in our environment it can be useful when there's an increase in the number of errors on a particular server, knowing the type can be quite useful.

While I'm not currently counting everything, such as device specific data which would significantly increase the volume of output, I think I have covered quite a lot in my model.

Comments?

-mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140624/7e887b79/attachment.html>

Open Stack

[openstack-dev] [swift] Providing a potentially more open interface to statsd statistics

OpenStack

Community

Documentation

Branding & Legal