[openstack-dev] [elastic-recheck] Thoughts on next steps

Sean Dague sean at dague.net
Sat Jan 4 13:15:03 UTC 2014


On 01/03/2014 12:09 PM, James E. Blair wrote:
> Sean Dague <sean at dague.net> writes:
>
>> So my feeling is we should move away from the point graphs we have,
>> and present these as weekly and daily failure rates (with graphs and
>> error bars). And slice those per job. My suggestion is that we do the
>> actual visualization with matplotlib because it's super easy to output
>> that from pandas data sets.
>
> I am very excited about this and everything above it!
>
>> = Take over of /recheck =
>>
>> There is still a bunch of useful data coming in on "recheck bug ####"
>> data which hasn't been curated into ER queries. I think the right
>> thing to do is treat these as a work queue of bugs we should be
>> building patterns out of (or completely invalidating). I've got a
>> preliminary gerrit bulk query piece of code that does this, which
>> would remove the need of the daemon the way that's currently
>> happening. The gerrit queries are a little long right now, but I think
>> if we are only doing this on hourly cron, the additional load will be
>> negligible.
>
> I think this is fine and am all for reducing complexity, but consider
> this alternative: over the break, I moved both components of
> elastic-recheck onto a new server (status.openstack.org).  Since they
> are now co-located, you could have the component of e-r that watches the
> stream to provide responses to gerrit also note recheck actions.  You
> could stick the data in a file, memcache, trove database, etc, and the
> status page could display that "work queue".  No extra daemons required.

So I've got the bulk query written. Which means we could have this by 
the end of next week with the approach I've got. I think that handling 
the rest of this is an optimization.

> I think the main user-visible aspect of this decision is the delay
> before unprocessed bugs are made visible.  If a bug starts affecting a
> number of jobs, it might be nice to see what bug numbers people are
> using for rechecks without waiting for the next cron run.

So my experience is that most rechecks happen > 1 hr after a patch 
fails. And the people that are sitting on patches for bugs that have 
never been seen before find their way to IRC.

The current state of the world is not all roses and unicorns. The 
recheck daemon has died, and not been noticed that it was dead for 
*weeks*. So a guarantee that we are only 1 hr delayed would actually be 
on average better than the delays we've seen over the last six months of 
following the event stream.

And again, we can optimize that over time.

I also think that caching should probably actually happen in gerritlib 
itself. There is a concern that too many things are hitting gerrit, and 
the result is that everyone is implementing their own client side 
caching to try to be nice. (like the pickles in Russell's review stats 
programs). This seems like the wrong place to do be doing it.

But, part of the reason for this email was to sort these sorts of issues 
out, so let me know if you think the caching issue is an architectural 
blocker.

Because if we're generally agreed on the architecture forward and are 
just reviewing for correctness, the code can move fast, and we can 
actually have ER 1.0 by the end of the month. Architecture review in 
gerrit is where we grind to a halt.

	-Sean

-- 
Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com
http://dague.net



More information about the OpenStack-dev mailing list