[openstack-dev] [elastic-recheck] Thoughts on next steps
James E. Blair
jeblair at openstack.org
Sun Jan 5 22:49:24 UTC 2014
Sean Dague <sean at dague.net> writes:
>> I think the main user-visible aspect of this decision is the delay
>> before unprocessed bugs are made visible. If a bug starts affecting a
>> number of jobs, it might be nice to see what bug numbers people are
>> using for rechecks without waiting for the next cron run.
> So my experience is that most rechecks happen > 1 hr after a patch
> fails. And the people that are sitting on patches for bugs that have
> never been seen before find their way to IRC.
> The current state of the world is not all roses and unicorns. The
> recheck daemon has died, and not been noticed that it was dead for
> *weeks*. So a guarantee that we are only 1 hr delayed would actually
> be on average better than the delays we've seen over the last six
> months of following the event stream.
I wasn't suggesting that we keep the recheck daemon, I was suggesting
moving the real-time observation of rechecks into the elastic-recheck
daemon which will remain an important component of this system for the
foreseeable future. It is fairly reliable and if it does die, we will
desperately want get it running again and fix the underlying problem
because it is so helpful.
> I also think that caching should probably actually happen in gerritlib
> itself. There is a concern that too many things are hitting gerrit,
> and the result is that everyone is implementing their own client side
> caching to try to be nice. (like the pickles in Russell's review stats
> programs). This seems like the wrong place to do be doing it.
That's not a bad idea, however it doesn't really address the fact that
you're looking for events -- you need to run a very large bulk query to
find all of the reviews over a certain amount of time. You could reduce
this by caching results and then only querying reviews that are newer
than the last update. But even so, you'll always have to query for that
window. That's not as bad as querying for the same two weeks of data
every X minutes, but since there's already a daemon watching all of the
events anyway in real time, you already have the information if you just
don't discard it.
> But, part of the reason for this email was to sort these sorts of
> issues out, so let me know if you think the caching issue is an
> architectural blocker.
> Because if we're generally agreed on the architecture forward and are
> just reviewing for correctness, the code can move fast, and we can
> actually have ER 1.0 by the end of the month. Architecture review in
> gerrit is where we grind to a halt.
It looks like the bulk queries take about 4 full minutes of Gerrit CPU
time to fetch data from the last two weeks (and the last two weeks have
been quiet; I'd expect the next two weeks to take longer). I don't
think it's going to kill us, but I think there are some really easy ways
to make this way more efficient, which isn't just about being nice to
Gerrit, but is also about being responsive for users.
My first preference is still to use the real-time data that the e-r
daemon collects already and feed it to the dashboard.
If you feel like the inter-process communication needed for that will
slow you down too much, then my second preference would be to introduce
local caching of the results so that you can query for
"-age:<query-interval>" instead of the full two weeks every time. (And
if it's generalized enough, sure let's add that to gerritlib.)
I really think we at least ought to do one of those. Running the same
bulk query repeatedly is, in this case, so inefficient that I think this
little bit of optimization is not premature.
Thanks again for working on this. I really appreciate it and the time
you're spending on architecture.
More information about the OpenStack-dev