[openstack-dev] [elastic-recheck] Thoughts on next steps

Matt Riedemann mriedem at linux.vnet.ibm.com
Tue Jan 7 23:44:03 UTC 2014



On 1/7/2014 5:26 PM, Sean Dague wrote:
> On 01/07/2014 06:20 PM, Matt Riedemann wrote:
>>
>>
>> On 1/2/2014 8:29 PM, Sean Dague wrote:
>>> A lot of elastic recheck this fall has been based on the ad hoc needs of
>>> the moment, in between diving down into the race bugs that were
>>> uncovered by it. This week away from it all helped provide a little
>>> perspective on what I think we need to do to call it *done* (i.e.
>>> something akin to a 1.0 even though we are CDing it).
>>>
>>> Here is my current thinking on the next major things that should happen.
>>> Opinions welcomed.
>>>
>>> (These are roughly in implementation order based on urgency)
>>>
>>> = Split of web UI =
>>>
>>> The elastic recheck page is becoming a mismash of what was needed at the
>>> time. I think what we really have emerging is:
>>>   * Overall Gate Health
>>>   * Known (to ER) Bugs
>>>   * Unknown (to ER) Bugs - more below
>>>
>>> I think the landing page should be Know Bugs, as that's where we want
>>> both bug hunters to go to prioritize things, as well as where people
>>> looking for known bugs should start.
>>>
>>> I think the overall Gate Health graphs should move to the zuul status
>>> page. Possibly as part of the collection of graphs at the bottom.
>>>
>>> We should have a secondary page (maybe column?) of the un-fingerprinted
>>> recheck bugs, largely to use as candidates for fingerprinting. This will
>>> let us eventually take over /recheck.
>>>
>>> = Data Analysis / Graphs =
>>>
>>> I spent a bunch of time playing with pandas over break
>>> (http://dague.net/2013/12/30/ipython-notebook-experiments/), it's kind
>>> of awesome. It also made me rethink our approach to handling the data.
>>>
>>> I think the rolling average approach we were taking is more precise than
>>> accurate. As these are statistical events they really need error bars.
>>> Because when we have a quiet night, and 1 job fails at 6am in the
>>> morning, the 100% failure rate it reflects in grenade needs to be
>>> quantified that it was 1 of 1, not 50 of 50.
>>>
>>> So my feeling is we should move away from the point graphs we have, and
>>> present these as weekly and daily failure rates (with graphs and error
>>> bars). And slice those per job. My suggestion is that we do the actual
>>> visualization with matplotlib because it's super easy to output that
>>> from pandas data sets.
>>>
>>> Basically we'll be mining Elastic Search -> Pandas TimeSeries ->
>>> transforms and analysis -> output tables and graphs. This is different
>>> enough from our current jquery graphing that I want to get ACKs before
>>> doing a bunch of work here and finding out people don't like it in
>>> reviews.
>>>
>>> Also in this process upgrade the metadata that we provide for each of
>>> those bugs so it's a little more clear what you are looking at.
>>>
>>> = Take over of /recheck =
>>>
>>> There is still a bunch of useful data coming in on "recheck bug ####"
>>> data which hasn't been curated into ER queries. I think the right thing
>>> to do is treat these as a work queue of bugs we should be building
>>> patterns out of (or completely invalidating). I've got a preliminary
>>> gerrit bulk query piece of code that does this, which would remove the
>>> need of the daemon the way that's currently happening. The gerrit
>>> queries are a little long right now, but I think if we are only doing
>>> this on hourly cron, the additional load will be negligible.
>>>
>>> This would get us into a single view, which I think would be more
>>> informative than the one we currently have.
>>>
>>> = Categorize all the jobs =
>>>
>>> We need a bit of refactoring to let us comment on all the jobs (not just
>>> tempest ones). Basically we assumed pep8 and docs don't fail in the gate
>>> at the beginning. Turns out they do, and are good indicators of infra /
>>> external factor bugs. They are a part of the story so we should put them
>>> in.
>>>
>>> = Multi Line Fingerprints =
>>>
>>> We've definitely found bugs where we never had a really satisfying
>>> single line match, but we had some great matches if we could do multi
>>> line.
>>>
>>> We could do that in ER, however it will mean giving up logstash as our
>>> UI, because those queries can't be done in logstash. So in order to do
>>> this we'll really need to implement some tools - cli minimum, which will
>>> let us easily test a bug. A custom web UI might be in order as well,
>>> though that's going to be it's own chunk of work, that we'll need more
>>> volunteers for.
>>>
>>> This would put us in a place where we should have all the infrastructure
>>> to track 90% of the race conditions, and talk about them in certainty as
>>> 1%, 5%, 0.1% bugs.
>>>
>>>      -Sean
>>>
>>
>> Let's add regexp query support to elastic-recheck so that I could have
>> fixed this better:
>>
>> https://review.openstack.org/#/c/65303/
>>
>> Then I could have just filtered the build_name with this:
>>
>> build_name:/(check|gate)-(tempest|grenade)-[a-z\-]+/
>
> If you want to extend the query files with:
>
> regex:
>     - build_name: /(check|gate)-(tempest|grenade)-[a-z\-]+/
>     - some_other_field: /some other regex/
>
> And make it work with the query builder, I think we should consider it.
> It would be good to know how much more expensive those queries get
> though, because our ES is under decent load as it is.
>
>      -Sean
>
>
>

Yeah, alternatively we could turn on wildcard support in the 
query_string capability but the docs warn against that for performance 
reasons (which you can negate a bit with allow_leading_wildcard=false).

I'm not sure how to figure out how much more expensive those queries get 
though to see if they are really a limiting factor for us supporting 
them?  Ideas on that?

-- 

Thanks,

Matt Riedemann




More information about the OpenStack-dev mailing list