[openstack-dev] [elastic-recheck] Thoughts on next steps

Joe Gordon joe.gordon0 at gmail.com
Tue Jan 7 20:07:46 UTC 2014


Everything sounds good!


On Mon, Jan 6, 2014 at 6:52 PM, Sean Dague <sean at dague.net> wrote:

> On 01/06/2014 07:04 PM, Joe Gordon wrote:
>
>> Overall this looks really good, and very spot on.
>>
>>
>> On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague <sean at dague.net
>> <mailto:sean at dague.net>> wrote:
>>
>>     A lot of elastic recheck this fall has been based on the ad hoc
>>     needs of the moment, in between diving down into the race bugs that
>>     were uncovered by it. This week away from it all helped provide a
>>     little perspective on what I think we need to do to call it *done*
>>     (i.e. something akin to a 1.0 even though we are CDing it).
>>
>>     Here is my current thinking on the next major things that should
>>     happen. Opinions welcomed.
>>
>>     (These are roughly in implementation order based on urgency)
>>
>>     = Split of web UI =
>>
>>     The elastic recheck page is becoming a mismash of what was needed at
>>     the time. I think what we really have emerging is:
>>       * Overall Gate Health
>>       * Known (to ER) Bugs
>>       * Unknown (to ER) Bugs - more below
>>
>>     I think the landing page should be Know Bugs, as that's where we
>>     want both bug hunters to go to prioritize things, as well as where
>>     people looking for known bugs should start.
>>
>>     I think the overall Gate Health graphs should move to the zuul
>>     status page. Possibly as part of the collection of graphs at the
>> bottom.
>>
>>     We should have a secondary page (maybe column?) of the
>>     un-fingerprinted recheck bugs, largely to use as candidates for
>>     fingerprinting. This will let us eventually take over /recheck.
>>
>>
>> I think it would be cool to collect the list of unclassified failures
>> (not by recheck bug), so we can see how many (and what percentage) need
>> to be classified. This isn't gate health but more of e-r health or
>> something like that.
>>
>
> Agreed. I've got the percentage in check_success today, but I agree that
> every gate job that fails that we don't have a fingerprint should be listed
> somewhere we can work through them.
>
>
>>     = Data Analysis / Graphs =
>>
>>     I spent a bunch of time playing with pandas over break
>>     (http://dague.net/2013/12/30/__ipython-notebook-experiments/
>>     <http://dague.net/2013/12/30/ipython-notebook-experiments/>)__, it's
>>
>>     kind of awesome. It also made me rethink our approach to handling
>>     the data.
>>
>>     I think the rolling average approach we were taking is more precise
>>     than accurate. As these are statistical events they really need
>>     error bars. Because when we have a quiet night, and 1 job fails at
>>     6am in the morning, the 100% failure rate it reflects in grenade
>>     needs to be quantified that it was 1 of 1, not 50 of 50.
>>
>>
>>     So my feeling is we should move away from the point graphs we have,
>>     and present these as weekly and daily failure rates (with graphs and
>>     error bars). And slice those per job. My suggestion is that we do
>>     the actual visualization with matplotlib because it's super easy to
>>     output that from pandas data sets.
>>
>>
>> The one thing that the current graph does, that weekly and daily failure
>> rates don't show, is a sudden spike in one of the lines.  If you stare
>> at the current graphs for long enough and can read through the noise,
>> you can see when the gate collectively crashes or if just the neutron
>> related gates start failing. So I think one more graph is needed.
>>
>
> The point of the visualizations is to make sense to people that don't
> understand all the data, especially core members of various teams that are
> trying to figure out "if I attack 1 bug right now, what's the biggest bang
> for my buck."
>
>
Yes, that is one of the big uses for a visualization.  the one I had in
mind was being able to see if a new unclassified bug appeared.


>
>      Basically we'll be mining Elastic Search -> Pandas TimeSeries ->
>>     transforms and analysis -> output tables and graphs. This is
>>     different enough from our current jquery graphing that I want to get
>>     ACKs before doing a bunch of work here and finding out people don't
>>     like it in reviews.
>>
>>     Also in this process upgrade the metadata that we provide for each
>>     of those bugs so it's a little more clear what you are looking at.
>>
>>
>> For example?
>>
>
> We should always be listing the bug title, not just the number. We should
> also list what projects it's filed against. I've stared at this bugs as
> much as anyone, and I still need to click through the top 4 to figure out
> which one is the ssh bug. :)
>
>
>      = Take over of /recheck =
>>
>>     There is still a bunch of useful data coming in on "recheck bug
>>     ####" data which hasn't been curated into ER queries. I think the
>>     right thing to do is treat these as a work queue of bugs we should
>>     be building patterns out of (or completely invalidating). I've got a
>>     preliminary gerrit bulk query piece of code that does this, which
>>     would remove the need of the daemon the way that's currently
>>     happening. The gerrit queries are a little long right now, but I
>>     think if we are only doing this on hourly cron, the additional load
>>     will be negligible.
>>
>>     This would get us into a single view, which I think would be more
>>     informative than the one we currently have.
>>
>>
>> treating /recheck as a work queue sounds great, but this needs a bit
>> more fleshing out I think.
>>
>> I imagine the workflow as something like this:
>>
>> * State 1: Path author files bug saying 'gate broke, I didn't do it and
>> don't know why it broke'.
>> * State 2: Someone investigates the bug and determines if bug is valid
>> and if its a duplicate or not. root cause still isn't known.
>> * State 3: Someone writes a fingerprint for this bug and commits it to
>> elastic-recheck.
>>
>> Assuming we agree on this general workflow, it would be nice if /recheck
>> distinguished between bugs in states 1 and 2, and there is no need to
>> list bugs in state 3 as e-r bot will automatically tell a developer when
>> he hits it.
>>
>
> Sure, that means policy on something in the bugs that can distinguish
> between. I assume LP states.
>
> State 1 = new & invalid?
> State 2 = confirmed / triaged?
>
> I think we can call that post 1.0 though, as we'll be adding details
> beyond anything we have today.


Yup, this sounds like post 1.0 to me too.

>
>
>      = Categorize all the jobs =
>>
>>     We need a bit of refactoring to let us comment on all the jobs (not
>>     just tempest ones). Basically we assumed pep8 and docs don't fail in
>>     the gate at the beginning. Turns out they do, and are good
>>     indicators of infra / external factor bugs. They are a part of the
>>     story so we should put them in.
>>
>>
>> Don't forget grenade
>>
>
> Yep. That's part of all. :) I was just calling out the others as something
> not originally on the list.
>
>
>      = Multi Line Fingerprints =
>>
>>     We've definitely found bugs where we never had a really satisfying
>>     single line match, but we had some great matches if we could do
>>     multi line.
>>
>>     We could do that in ER, however it will mean giving up logstash as
>>     our UI, because those queries can't be done in logstash. So in order
>>     to do this we'll really need to implement some tools - cli minimum,
>>     which will let us easily test a bug. A custom web UI might be in
>>     order as well, though that's going to be it's own chunk of work,
>>     that we'll need more volunteers for.
>>
>>     This would put us in a place where we should have all the
>>     infrastructure to track 90% of the race conditions, and talk about
>>     them in certainty as 1%, 5%, 0.1% bugs.
>>
>>
>>
>> Horrah. multi line matches are two separate elasticSearch queries, where
>> you match build_uuids.  So to get the set of all hits of a multi line
>> fingerprint you find the intersection between line_1 and line_2 where
>> the key is build_uuid
>>
>
> Yes. The biggest issue is tooling for making it easy for people to test
> their queries. It's pretty unfriendly to tell people to do manual
> correlation in ES.
>
>
>         -Sean
>
> --
> Sean Dague
> Samsung Research America
> sean at dague.net / sean.dague at samsung.com
> http://dague.net
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140107/dcfa4eb4/attachment.html>


More information about the OpenStack-dev mailing list