[openstack-dev] [elastic-recheck] Thoughts on next steps

Clark Boylan clark.boylan at gmail.com
Fri Jan 3 02:44:11 UTC 2014


On Thu, Jan 2, 2014 at 6:29 PM, Sean Dague <sean at dague.net> wrote:
> A lot of elastic recheck this fall has been based on the ad hoc needs of the
> moment, in between diving down into the race bugs that were uncovered by it.
> This week away from it all helped provide a little perspective on what I
> think we need to do to call it *done* (i.e. something akin to a 1.0 even
> though we are CDing it).
>
> Here is my current thinking on the next major things that should happen.
> Opinions welcomed.
>
> (These are roughly in implementation order based on urgency)
>
> = Split of web UI =
>
> The elastic recheck page is becoming a mismash of what was needed at the
> time. I think what we really have emerging is:
>  * Overall Gate Health
>  * Known (to ER) Bugs
>  * Unknown (to ER) Bugs - more below
>
> I think the landing page should be Know Bugs, as that's where we want both
> bug hunters to go to prioritize things, as well as where people looking for
> known bugs should start.
>
> I think the overall Gate Health graphs should move to the zuul status page.
> Possibly as part of the collection of graphs at the bottom.
>
> We should have a secondary page (maybe column?) of the un-fingerprinted
> recheck bugs, largely to use as candidates for fingerprinting. This will let
> us eventually take over /recheck.
>
> = Data Analysis / Graphs =
>
> I spent a bunch of time playing with pandas over break
> (http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind of
> awesome. It also made me rethink our approach to handling the data.
>
> I think the rolling average approach we were taking is more precise than
> accurate. As these are statistical events they really need error bars.
> Because when we have a quiet night, and 1 job fails at 6am in the morning,
> the 100% failure rate it reflects in grenade needs to be quantified that it
> was 1 of 1, not 50 of 50.
>
> So my feeling is we should move away from the point graphs we have, and
> present these as weekly and daily failure rates (with graphs and error
> bars). And slice those per job. My suggestion is that we do the actual
> visualization with matplotlib because it's super easy to output that from
> pandas data sets.
>
> Basically we'll be mining Elastic Search -> Pandas TimeSeries -> transforms
> and analysis -> output tables and graphs. This is different enough from our
> current jquery graphing that I want to get ACKs before doing a bunch of work
> here and finding out people don't like it in reviews.
>
> Also in this process upgrade the metadata that we provide for each of those
> bugs so it's a little more clear what you are looking at.
>
> = Take over of /recheck =
>
> There is still a bunch of useful data coming in on "recheck bug ####" data
> which hasn't been curated into ER queries. I think the right thing to do is
> treat these as a work queue of bugs we should be building patterns out of
> (or completely invalidating). I've got a preliminary gerrit bulk query piece
> of code that does this, which would remove the need of the daemon the way
> that's currently happening. The gerrit queries are a little long right now,
> but I think if we are only doing this on hourly cron, the additional load
> will be negligible.
>
> This would get us into a single view, which I think would be more
> informative than the one we currently have.
>
> = Categorize all the jobs =
>
> We need a bit of refactoring to let us comment on all the jobs (not just
> tempest ones). Basically we assumed pep8 and docs don't fail in the gate at
> the beginning. Turns out they do, and are good indicators of infra /
> external factor bugs. They are a part of the story so we should put them in.
>
> = Multi Line Fingerprints =
>
> We've definitely found bugs where we never had a really satisfying single
> line match, but we had some great matches if we could do multi line.
>
> We could do that in ER, however it will mean giving up logstash as our UI,
> because those queries can't be done in logstash. So in order to do this
> we'll really need to implement some tools - cli minimum, which will let us
> easily test a bug. A custom web UI might be in order as well, though that's
> going to be it's own chunk of work, that we'll need more volunteers for.
>
> This would put us in a place where we should have all the infrastructure to
> track 90% of the race conditions, and talk about them in certainty as 1%,
> 5%, 0.1% bugs.
>
>         -Sean
>
> --
> Sean Dague
> Samsung Research America
> sean at dague.net / sean.dague at samsung.com
> http://dague.net
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

This is great stuff. Out of curiousity is doing the graphing with
pandas and ES vs graphite so that we can graph things in a more ad hoc
fashion? Also, for the dashboard, Kibana3 does a lot more stuff than
Kibana2 which we currently use. I have been meaning to get Kibana3
running alongside Kibana2 and I think it may be able to do multi line
queries (I need to double check that but it has a lot more query and
graphing capability). I think Kibana3 is worth looking into as well
before we go too far down the road of custom UI.

Clark



More information about the OpenStack-dev mailing list