Open Stack

Fri Jan 3 02:53:57 UTC 2014

On Thu, Jan 2, 2014 at 6:44 PM, Clark Boylan <clark.boylan at gmail.com> wrote:
> On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague <sean at dague.net> wrote:
>> A lot of elastic recheck this fall has been based on the ad hoc needs of the
>> moment, in between diving down into the race bugs that were uncovered by it.
>> This week away from it all helped provide a little perspective on what I
>> think we need to do to call it *done* (i.e. something akin to a 1.0 even
>> though we are CDing it).
>>
>> Here is my current thinking on the next major things that should happen.
>> Opinions welcomed.
>>
>> (These are roughly in implementation order based on urgency)
>>
>> = Split of web UI =
>>
>> The elastic recheck page is becoming a mismash of what was needed at the
>> time. I think what we really have emerging is:
>>  * Overall Gate Health
>>  * Known (to ER) Bugs
>>  * Unknown (to ER) Bugs - more below
>>
>> I think the landing page should be Know Bugs, as that's where we want both
>> bug hunters to go to prioritize things, as well as where people looking for
>> known bugs should start.
>>
>> I think the overall Gate Health graphs should move to the zuul status page.
>> Possibly as part of the collection of graphs at the bottom.
>>
>> We should have a secondary page (maybe column?) of the un-fingerprinted
>> recheck bugs, largely to use as candidates for fingerprinting. This will let
>> us eventually take over /recheck.
>>
>> = Data Analysis / Graphs =
>>
>> I spent a bunch of time playing with pandas over break
>> (http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind of
>> awesome. It also made me rethink our approach to handling the data.
>>
>> I think the rolling average approach we were taking is more precise than
>> accurate. As these are statistical events they really need error bars.
>> Because when we have a quiet night, and 1 job fails at 6am in the morning,
>> the 100% failure rate it reflects in grenade needs to be quantified that it
>> was 1 of 1, not 50 of 50.
>>
>> So my feeling is we should move away from the point graphs we have, and
>> present these as weekly and daily failure rates (with graphs and error
>> bars). And slice those per job. My suggestion is that we do the actual
>> visualization with matplotlib because it's super easy to output that from
>> pandas data sets.
>>
>> Basically we'll be mining Elastic Search -> Pandas TimeSeries -> transforms
>> and analysis -> output tables and graphs. This is different enough from our
>> current jquery graphing that I want to get ACKs before doing a bunch of work
>> here and finding out people don't like it in reviews.
>>
>> Also in this process upgrade the metadata that we provide for each of those
>> bugs so it's a little more clear what you are looking at.
>>
>> = Take over of /recheck =
>>
>> There is still a bunch of useful data coming in on "recheck bug ####" data
>> which hasn't been curated into ER queries. I think the right thing to do is
>> treat these as a work queue of bugs we should be building patterns out of
>> (or completely invalidating). I've got a preliminary gerrit bulk query piece
>> of code that does this, which would remove the need of the daemon the way
>> that's currently happening. The gerrit queries are a little long right now,
>> but I think if we are only doing this on hourly cron, the additional load
>> will be negligible.
>>
>> This would get us into a single view, which I think would be more
>> informative than the one we currently have.
>>
>> = Categorize all the jobs =
>>
>> We need a bit of refactoring to let us comment on all the jobs (not just
>> tempest ones). Basically we assumed pep8 and docs don't fail in the gate at
>> the beginning. Turns out they do, and are good indicators of infra /
>> external factor bugs. They are a part of the story so we should put them in.
>>
>> = Multi Line Fingerprints =
>>
>> We've definitely found bugs where we never had a really satisfying single
>> line match, but we had some great matches if we could do multi line.
>>
>> We could do that in ER, however it will mean giving up logstash as our UI,
>> because those queries can't be done in logstash. So in order to do this
>> we'll really need to implement some tools - cli minimum, which will let us
>> easily test a bug. A custom web UI might be in order as well, though that's
>> going to be it's own chunk of work, that we'll need more volunteers for.
>>
>> This would put us in a place where we should have all the infrastructure to
>> track 90% of the race conditions, and talk about them in certainty as 1%,
>> 5%, 0.1% bugs.
>>
>>         -Sean
>>
>> --
>> Sean Dague
>> Samsung Research America
>> sean at dague.net / sean.dague at samsung.com
>> http://dague.net
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> This is great stuff. Out of curiousity is doing the graphing with
> pandas and ES vs graphite so that we can graph things in a more ad hoc
> fashion? Also, for the dashboard, Kibana3 does a lot more stuff than
> Kibana2 which we currently use. I have been meaning to get Kibana3
> running alongside Kibana2 and I think it may be able to do multi line
> queries (I need to double check that but it has a lot more query and
> graphing capability). I think Kibana3 is worth looking into as well
> before we go too far down the road of custom UI.
>
> Clark

A quick check at http://demo.kibana.org/#/dashboard shows that while
it supports multiple queries it just ORs all of the results together.
So it doesn't quite do what we need.

Clark

Open Stack

[openstack-dev] [elastic-recheck] Thoughts on next steps

OpenStack

Community

Documentation

Branding & Legal