[openstack-dev] [elastic-recheck] Thoughts on next steps
Sean Dague
sean at dague.net
Mon Jan 6 00:16:24 UTC 2014
On 01/05/2014 05:49 PM, James E. Blair wrote:
> Sean Dague <sean at dague.net> writes:
>
>>> I think the main user-visible aspect of this decision is the delay
>>> before unprocessed bugs are made visible. If a bug starts affecting a
>>> number of jobs, it might be nice to see what bug numbers people are
>>> using for rechecks without waiting for the next cron run.
>>
>> So my experience is that most rechecks happen > 1 hr after a patch
>> fails. And the people that are sitting on patches for bugs that have
>> never been seen before find their way to IRC.
>>
>> The current state of the world is not all roses and unicorns. The
>> recheck daemon has died, and not been noticed that it was dead for
>> *weeks*. So a guarantee that we are only 1 hr delayed would actually
>> be on average better than the delays we've seen over the last six
>> months of following the event stream.
>
> I wasn't suggesting that we keep the recheck daemon, I was suggesting
> moving the real-time observation of rechecks into the elastic-recheck
> daemon which will remain an important component of this system for the
> foreseeable future. It is fairly reliable and if it does die, we will
> desperately want get it running again and fix the underlying problem
> because it is so helpful.
That's a possible place to put it. The daemon is a bit of a mess at the
moment, so I was hoping to not refactor it until the end of the month as
part of the cleaning up to handle the additional jobs.
>> I also think that caching should probably actually happen in gerritlib
>> itself. There is a concern that too many things are hitting gerrit,
>> and the result is that everyone is implementing their own client side
>> caching to try to be nice. (like the pickles in Russell's review stats
>> programs). This seems like the wrong place to do be doing it.
>
> That's not a bad idea, however it doesn't really address the fact that
> you're looking for events -- you need to run a very large bulk query to
> find all of the reviews over a certain amount of time. You could reduce
> this by caching results and then only querying reviews that are newer
> than the last update. But even so, you'll always have to query for that
> window. That's not as bad as querying for the same two weeks of data
> every X minutes, but since there's already a daemon watching all of the
> events anyway in real time, you already have the information if you just
> don't discard it.
I don't really want to trust us not failing, because we do. So we're
going to need replay ability anyway.
>> But, part of the reason for this email was to sort these sorts of
>> issues out, so let me know if you think the caching issue is an
>> architectural blocker.
>>
>> Because if we're generally agreed on the architecture forward and are
>> just reviewing for correctness, the code can move fast, and we can
>> actually have ER 1.0 by the end of the month. Architecture review in
>> gerrit is where we grind to a halt.
>
> It looks like the bulk queries take about 4 full minutes of Gerrit CPU
> time to fetch data from the last two weeks (and the last two weeks have
> been quiet; I'd expect the next two weeks to take longer). I don't
> think it's going to kill us, but I think there are some really easy ways
> to make this way more efficient, which isn't just about being nice to
> Gerrit, but is also about being responsive for users.
Interesting, I thought this was more like 1 minute. 4 definitely gets a
bit wonkier.
> My first preference is still to use the real-time data that the e-r
> daemon collects already and feed it to the dashboard.
>
> If you feel like the inter-process communication needed for that will
> slow you down too much, then my second preference would be to introduce
> local caching of the results so that you can query for
> "-age:<query-interval>" instead of the full two weeks every time. (And
> if it's generalized enough, sure let's add that to gerritlib.)
Yeh, the biggest complexity is the result merge. I was finding that
-age:4h still ended up return nearly 20% of the entire dataset, and
wasn't as much quicker as you'd expect.
But the new data and the old data are overlapping a lot, because you can
only query by time on the review, not on the comments. And those are
leaves in funny ways.
I think the right way to do that would be build on top of pandas data
series merge functionality. All good things, just new building blocks we
don't have yet.
> I really think we at least ought to do one of those. Running the same
> bulk query repeatedly is, in this case, so inefficient that I think this
> little bit of optimization is not premature.
Sure, I wonder how the various other review stats tools are handling
this case. Putting Russell and Ilya (Stackalytics) into the mix. Because
it seems like we should have a common solution here for all the tools
hitting gerrit on cron for largely the same info.
-Sean
--
Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com
http://dague.net
More information about the OpenStack-dev
mailing list