[openstack-dev] [elastic-recheck] Thoughts on next steps

Sean Dague sean at dague.net
Mon Jan 6 00:16:24 UTC 2014


On 01/05/2014 05:49 PM, James E. Blair wrote:
> Sean Dague <sean at dague.net> writes:
>
>>> I think the main user-visible aspect of this decision is the delay
>>> before unprocessed bugs are made visible.  If a bug starts affecting a
>>> number of jobs, it might be nice to see what bug numbers people are
>>> using for rechecks without waiting for the next cron run.
>>
>> So my experience is that most rechecks happen > 1 hr after a patch
>> fails. And the people that are sitting on patches for bugs that have
>> never been seen before find their way to IRC.
>>
>> The current state of the world is not all roses and unicorns. The
>> recheck daemon has died, and not been noticed that it was dead for
>> *weeks*. So a guarantee that we are only 1 hr delayed would actually
>> be on average better than the delays we've seen over the last six
>> months of following the event stream.
>
> I wasn't suggesting that we keep the recheck daemon, I was suggesting
> moving the real-time observation of rechecks into the elastic-recheck
> daemon which will remain an important component of this system for the
> foreseeable future.  It is fairly reliable and if it does die, we will
> desperately want get it running again and fix the underlying problem
> because it is so helpful.

That's a possible place to put it. The daemon is a bit of a mess at the 
moment, so I was hoping to not refactor it until the end of the month as 
part of the cleaning up to handle the additional jobs.

>> I also think that caching should probably actually happen in gerritlib
>> itself. There is a concern that too many things are hitting gerrit,
>> and the result is that everyone is implementing their own client side
>> caching to try to be nice. (like the pickles in Russell's review stats
>> programs). This seems like the wrong place to do be doing it.
>
> That's not a bad idea, however it doesn't really address the fact that
> you're looking for events -- you need to run a very large bulk query to
> find all of the reviews over a certain amount of time.  You could reduce
> this by caching results and then only querying reviews that are newer
> than the last update.  But even so, you'll always have to query for that
> window.  That's not as bad as querying for the same two weeks of data
> every X minutes, but since there's already a daemon watching all of the
> events anyway in real time, you already have the information if you just
> don't discard it.

I don't really want to trust us not failing, because we do. So we're 
going to need replay ability anyway.

>> But, part of the reason for this email was to sort these sorts of
>> issues out, so let me know if you think the caching issue is an
>> architectural blocker.
>>
>> Because if we're generally agreed on the architecture forward and are
>> just reviewing for correctness, the code can move fast, and we can
>> actually have ER 1.0 by the end of the month. Architecture review in
>> gerrit is where we grind to a halt.
>
> It looks like the bulk queries take about 4 full minutes of Gerrit CPU
> time to fetch data from the last two weeks (and the last two weeks have
> been quiet; I'd expect the next two weeks to take longer).  I don't
> think it's going to kill us, but I think there are some really easy ways
> to make this way more efficient, which isn't just about being nice to
> Gerrit, but is also about being responsive for users.

Interesting, I thought this was more like 1 minute. 4 definitely gets a 
bit wonkier.

> My first preference is still to use the real-time data that the e-r
> daemon collects already and feed it to the dashboard.
>
> If you feel like the inter-process communication needed for that will
> slow you down too much, then my second preference would be to introduce
> local caching of the results so that you can query for
> "-age:<query-interval>" instead of the full two weeks every time.  (And
> if it's generalized enough, sure let's add that to gerritlib.)

Yeh, the biggest complexity is the result merge. I was finding that 
-age:4h still ended up return nearly 20% of the entire dataset, and 
wasn't as much quicker as you'd expect.

But the new data and the old data are overlapping a lot, because you can 
only query by time on the review, not on the comments. And those are 
leaves in funny ways.

I think the right way to do that would be build on top of pandas data 
series merge functionality. All good things, just new building blocks we 
don't have yet.

> I really think we at least ought to do one of those.  Running the same
> bulk query repeatedly is, in this case, so inefficient that I think this
> little bit of optimization is not premature.

Sure, I wonder how the various other review stats tools are handling 
this case. Putting Russell and Ilya (Stackalytics) into the mix. Because 
it seems like we should have a common solution here for all the tools 
hitting gerrit on cron for largely the same info.

	-Sean

-- 
Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com
http://dague.net



More information about the OpenStack-dev mailing list