[openstack-dev] [elastic-recheck] Thoughts on next steps
Sean Dague
sean at dague.net
Tue Jan 7 02:52:31 UTC 2014
On 01/06/2014 07:04 PM, Joe Gordon wrote:
> Overall this looks really good, and very spot on.
>
>
> On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague <sean at dague.net
> <mailto:sean at dague.net>> wrote:
>
> A lot of elastic recheck this fall has been based on the ad hoc
> needs of the moment, in between diving down into the race bugs that
> were uncovered by it. This week away from it all helped provide a
> little perspective on what I think we need to do to call it *done*
> (i.e. something akin to a 1.0 even though we are CDing it).
>
> Here is my current thinking on the next major things that should
> happen. Opinions welcomed.
>
> (These are roughly in implementation order based on urgency)
>
> = Split of web UI =
>
> The elastic recheck page is becoming a mismash of what was needed at
> the time. I think what we really have emerging is:
> * Overall Gate Health
> * Known (to ER) Bugs
> * Unknown (to ER) Bugs - more below
>
> I think the landing page should be Know Bugs, as that's where we
> want both bug hunters to go to prioritize things, as well as where
> people looking for known bugs should start.
>
> I think the overall Gate Health graphs should move to the zuul
> status page. Possibly as part of the collection of graphs at the bottom.
>
> We should have a secondary page (maybe column?) of the
> un-fingerprinted recheck bugs, largely to use as candidates for
> fingerprinting. This will let us eventually take over /recheck.
>
>
> I think it would be cool to collect the list of unclassified failures
> (not by recheck bug), so we can see how many (and what percentage) need
> to be classified. This isn't gate health but more of e-r health or
> something like that.
Agreed. I've got the percentage in check_success today, but I agree that
every gate job that fails that we don't have a fingerprint should be
listed somewhere we can work through them.
>
> = Data Analysis / Graphs =
>
> I spent a bunch of time playing with pandas over break
> (http://dague.net/2013/12/30/__ipython-notebook-experiments/
> <http://dague.net/2013/12/30/ipython-notebook-experiments/>)__, it's
> kind of awesome. It also made me rethink our approach to handling
> the data.
>
> I think the rolling average approach we were taking is more precise
> than accurate. As these are statistical events they really need
> error bars. Because when we have a quiet night, and 1 job fails at
> 6am in the morning, the 100% failure rate it reflects in grenade
> needs to be quantified that it was 1 of 1, not 50 of 50.
>
>
> So my feeling is we should move away from the point graphs we have,
> and present these as weekly and daily failure rates (with graphs and
> error bars). And slice those per job. My suggestion is that we do
> the actual visualization with matplotlib because it's super easy to
> output that from pandas data sets.
>
>
> The one thing that the current graph does, that weekly and daily failure
> rates don't show, is a sudden spike in one of the lines. If you stare
> at the current graphs for long enough and can read through the noise,
> you can see when the gate collectively crashes or if just the neutron
> related gates start failing. So I think one more graph is needed.
The point of the visualizations is to make sense to people that don't
understand all the data, especially core members of various teams that
are trying to figure out "if I attack 1 bug right now, what's the
biggest bang for my buck."
> Basically we'll be mining Elastic Search -> Pandas TimeSeries ->
> transforms and analysis -> output tables and graphs. This is
> different enough from our current jquery graphing that I want to get
> ACKs before doing a bunch of work here and finding out people don't
> like it in reviews.
>
> Also in this process upgrade the metadata that we provide for each
> of those bugs so it's a little more clear what you are looking at.
>
>
> For example?
We should always be listing the bug title, not just the number. We
should also list what projects it's filed against. I've stared at this
bugs as much as anyone, and I still need to click through the top 4 to
figure out which one is the ssh bug. :)
> = Take over of /recheck =
>
> There is still a bunch of useful data coming in on "recheck bug
> ####" data which hasn't been curated into ER queries. I think the
> right thing to do is treat these as a work queue of bugs we should
> be building patterns out of (or completely invalidating). I've got a
> preliminary gerrit bulk query piece of code that does this, which
> would remove the need of the daemon the way that's currently
> happening. The gerrit queries are a little long right now, but I
> think if we are only doing this on hourly cron, the additional load
> will be negligible.
>
> This would get us into a single view, which I think would be more
> informative than the one we currently have.
>
>
> treating /recheck as a work queue sounds great, but this needs a bit
> more fleshing out I think.
>
> I imagine the workflow as something like this:
>
> * State 1: Path author files bug saying 'gate broke, I didn't do it and
> don't know why it broke'.
> * State 2: Someone investigates the bug and determines if bug is valid
> and if its a duplicate or not. root cause still isn't known.
> * State 3: Someone writes a fingerprint for this bug and commits it to
> elastic-recheck.
>
> Assuming we agree on this general workflow, it would be nice if /recheck
> distinguished between bugs in states 1 and 2, and there is no need to
> list bugs in state 3 as e-r bot will automatically tell a developer when
> he hits it.
Sure, that means policy on something in the bugs that can distinguish
between. I assume LP states.
State 1 = new & invalid?
State 2 = confirmed / triaged?
I think we can call that post 1.0 though, as we'll be adding details
beyond anything we have today.
> = Categorize all the jobs =
>
> We need a bit of refactoring to let us comment on all the jobs (not
> just tempest ones). Basically we assumed pep8 and docs don't fail in
> the gate at the beginning. Turns out they do, and are good
> indicators of infra / external factor bugs. They are a part of the
> story so we should put them in.
>
>
> Don't forget grenade
Yep. That's part of all. :) I was just calling out the others as
something not originally on the list.
> = Multi Line Fingerprints =
>
> We've definitely found bugs where we never had a really satisfying
> single line match, but we had some great matches if we could do
> multi line.
>
> We could do that in ER, however it will mean giving up logstash as
> our UI, because those queries can't be done in logstash. So in order
> to do this we'll really need to implement some tools - cli minimum,
> which will let us easily test a bug. A custom web UI might be in
> order as well, though that's going to be it's own chunk of work,
> that we'll need more volunteers for.
>
> This would put us in a place where we should have all the
> infrastructure to track 90% of the race conditions, and talk about
> them in certainty as 1%, 5%, 0.1% bugs.
>
>
>
> Horrah. multi line matches are two separate elasticSearch queries, where
> you match build_uuids. So to get the set of all hits of a multi line
> fingerprint you find the intersection between line_1 and line_2 where
> the key is build_uuid
Yes. The biggest issue is tooling for making it easy for people to test
their queries. It's pretty unfriendly to tell people to do manual
correlation in ES.
-Sean
--
Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com
http://dague.net
More information about the OpenStack-dev
mailing list