[openstack-dev] [elastic-recheck] Thoughts on next steps
Joe Gordon
joe.gordon0 at gmail.com
Tue Jan 7 20:07:46 UTC 2014
Everything sounds good!
On Mon, Jan 6, 2014 at 6:52 PM, Sean Dague <sean at dague.net> wrote:
> On 01/06/2014 07:04 PM, Joe Gordon wrote:
>
>> Overall this looks really good, and very spot on.
>>
>>
>> On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague <sean at dague.net
>> <mailto:sean at dague.net>> wrote:
>>
>> A lot of elastic recheck this fall has been based on the ad hoc
>> needs of the moment, in between diving down into the race bugs that
>> were uncovered by it. This week away from it all helped provide a
>> little perspective on what I think we need to do to call it *done*
>> (i.e. something akin to a 1.0 even though we are CDing it).
>>
>> Here is my current thinking on the next major things that should
>> happen. Opinions welcomed.
>>
>> (These are roughly in implementation order based on urgency)
>>
>> = Split of web UI =
>>
>> The elastic recheck page is becoming a mismash of what was needed at
>> the time. I think what we really have emerging is:
>> * Overall Gate Health
>> * Known (to ER) Bugs
>> * Unknown (to ER) Bugs - more below
>>
>> I think the landing page should be Know Bugs, as that's where we
>> want both bug hunters to go to prioritize things, as well as where
>> people looking for known bugs should start.
>>
>> I think the overall Gate Health graphs should move to the zuul
>> status page. Possibly as part of the collection of graphs at the
>> bottom.
>>
>> We should have a secondary page (maybe column?) of the
>> un-fingerprinted recheck bugs, largely to use as candidates for
>> fingerprinting. This will let us eventually take over /recheck.
>>
>>
>> I think it would be cool to collect the list of unclassified failures
>> (not by recheck bug), so we can see how many (and what percentage) need
>> to be classified. This isn't gate health but more of e-r health or
>> something like that.
>>
>
> Agreed. I've got the percentage in check_success today, but I agree that
> every gate job that fails that we don't have a fingerprint should be listed
> somewhere we can work through them.
>
>
>> = Data Analysis / Graphs =
>>
>> I spent a bunch of time playing with pandas over break
>> (http://dague.net/2013/12/30/__ipython-notebook-experiments/
>> <http://dague.net/2013/12/30/ipython-notebook-experiments/>)__, it's
>>
>> kind of awesome. It also made me rethink our approach to handling
>> the data.
>>
>> I think the rolling average approach we were taking is more precise
>> than accurate. As these are statistical events they really need
>> error bars. Because when we have a quiet night, and 1 job fails at
>> 6am in the morning, the 100% failure rate it reflects in grenade
>> needs to be quantified that it was 1 of 1, not 50 of 50.
>>
>>
>> So my feeling is we should move away from the point graphs we have,
>> and present these as weekly and daily failure rates (with graphs and
>> error bars). And slice those per job. My suggestion is that we do
>> the actual visualization with matplotlib because it's super easy to
>> output that from pandas data sets.
>>
>>
>> The one thing that the current graph does, that weekly and daily failure
>> rates don't show, is a sudden spike in one of the lines. If you stare
>> at the current graphs for long enough and can read through the noise,
>> you can see when the gate collectively crashes or if just the neutron
>> related gates start failing. So I think one more graph is needed.
>>
>
> The point of the visualizations is to make sense to people that don't
> understand all the data, especially core members of various teams that are
> trying to figure out "if I attack 1 bug right now, what's the biggest bang
> for my buck."
>
>
Yes, that is one of the big uses for a visualization. the one I had in
mind was being able to see if a new unclassified bug appeared.
>
> Basically we'll be mining Elastic Search -> Pandas TimeSeries ->
>> transforms and analysis -> output tables and graphs. This is
>> different enough from our current jquery graphing that I want to get
>> ACKs before doing a bunch of work here and finding out people don't
>> like it in reviews.
>>
>> Also in this process upgrade the metadata that we provide for each
>> of those bugs so it's a little more clear what you are looking at.
>>
>>
>> For example?
>>
>
> We should always be listing the bug title, not just the number. We should
> also list what projects it's filed against. I've stared at this bugs as
> much as anyone, and I still need to click through the top 4 to figure out
> which one is the ssh bug. :)
>
>
> = Take over of /recheck =
>>
>> There is still a bunch of useful data coming in on "recheck bug
>> ####" data which hasn't been curated into ER queries. I think the
>> right thing to do is treat these as a work queue of bugs we should
>> be building patterns out of (or completely invalidating). I've got a
>> preliminary gerrit bulk query piece of code that does this, which
>> would remove the need of the daemon the way that's currently
>> happening. The gerrit queries are a little long right now, but I
>> think if we are only doing this on hourly cron, the additional load
>> will be negligible.
>>
>> This would get us into a single view, which I think would be more
>> informative than the one we currently have.
>>
>>
>> treating /recheck as a work queue sounds great, but this needs a bit
>> more fleshing out I think.
>>
>> I imagine the workflow as something like this:
>>
>> * State 1: Path author files bug saying 'gate broke, I didn't do it and
>> don't know why it broke'.
>> * State 2: Someone investigates the bug and determines if bug is valid
>> and if its a duplicate or not. root cause still isn't known.
>> * State 3: Someone writes a fingerprint for this bug and commits it to
>> elastic-recheck.
>>
>> Assuming we agree on this general workflow, it would be nice if /recheck
>> distinguished between bugs in states 1 and 2, and there is no need to
>> list bugs in state 3 as e-r bot will automatically tell a developer when
>> he hits it.
>>
>
> Sure, that means policy on something in the bugs that can distinguish
> between. I assume LP states.
>
> State 1 = new & invalid?
> State 2 = confirmed / triaged?
>
> I think we can call that post 1.0 though, as we'll be adding details
> beyond anything we have today.
Yup, this sounds like post 1.0 to me too.
>
>
> = Categorize all the jobs =
>>
>> We need a bit of refactoring to let us comment on all the jobs (not
>> just tempest ones). Basically we assumed pep8 and docs don't fail in
>> the gate at the beginning. Turns out they do, and are good
>> indicators of infra / external factor bugs. They are a part of the
>> story so we should put them in.
>>
>>
>> Don't forget grenade
>>
>
> Yep. That's part of all. :) I was just calling out the others as something
> not originally on the list.
>
>
> = Multi Line Fingerprints =
>>
>> We've definitely found bugs where we never had a really satisfying
>> single line match, but we had some great matches if we could do
>> multi line.
>>
>> We could do that in ER, however it will mean giving up logstash as
>> our UI, because those queries can't be done in logstash. So in order
>> to do this we'll really need to implement some tools - cli minimum,
>> which will let us easily test a bug. A custom web UI might be in
>> order as well, though that's going to be it's own chunk of work,
>> that we'll need more volunteers for.
>>
>> This would put us in a place where we should have all the
>> infrastructure to track 90% of the race conditions, and talk about
>> them in certainty as 1%, 5%, 0.1% bugs.
>>
>>
>>
>> Horrah. multi line matches are two separate elasticSearch queries, where
>> you match build_uuids. So to get the set of all hits of a multi line
>> fingerprint you find the intersection between line_1 and line_2 where
>> the key is build_uuid
>>
>
> Yes. The biggest issue is tooling for making it easy for people to test
> their queries. It's pretty unfriendly to tell people to do manual
> correlation in ES.
>
>
> -Sean
>
> --
> Sean Dague
> Samsung Research America
> sean at dague.net / sean.dague at samsung.com
> http://dague.net
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140107/dcfa4eb4/attachment.html>
More information about the OpenStack-dev
mailing list