[openstack-dev] [elastic-recheck] Thoughts on next steps
James E. Blair
jeblair at openstack.org
Fri Jan 3 17:09:21 UTC 2014
Sean Dague <sean at dague.net> writes:
> So my feeling is we should move away from the point graphs we have,
> and present these as weekly and daily failure rates (with graphs and
> error bars). And slice those per job. My suggestion is that we do the
> actual visualization with matplotlib because it's super easy to output
> that from pandas data sets.
I am very excited about this and everything above it!
> = Take over of /recheck =
> There is still a bunch of useful data coming in on "recheck bug ####"
> data which hasn't been curated into ER queries. I think the right
> thing to do is treat these as a work queue of bugs we should be
> building patterns out of (or completely invalidating). I've got a
> preliminary gerrit bulk query piece of code that does this, which
> would remove the need of the daemon the way that's currently
> happening. The gerrit queries are a little long right now, but I think
> if we are only doing this on hourly cron, the additional load will be
I think this is fine and am all for reducing complexity, but consider
this alternative: over the break, I moved both components of
elastic-recheck onto a new server (status.openstack.org). Since they
are now co-located, you could have the component of e-r that watches the
stream to provide responses to gerrit also note recheck actions. You
could stick the data in a file, memcache, trove database, etc, and the
status page could display that "work queue". No extra daemons required.
I think the main user-visible aspect of this decision is the delay
before unprocessed bugs are made visible. If a bug starts affecting a
number of jobs, it might be nice to see what bug numbers people are
using for rechecks without waiting for the next cron run.
On another topic, it's worth mentioning that we now (again, this is new
from over the break) have timeouts _inside_ the devstack-gate jobs that
should hit before the Jenkins timeout, so log collection for
devstack-gate jobs that run long and hit the timeout should still happen
(meaning that e-r can now see these failures).
Thanks for all your work on this. I think it's extremely useful and
More information about the OpenStack-dev