[OpenStack-Infra] [CI] Gathering Results for Research

Clark Boylan cboylan at sapwetik.org
Tue Dec 11 18:00:18 UTC 2018


On Mon, Dec 10, 2018, at 8:56 PM, Vysali Vaidhyam Subramanian wrote:
> Hello ,
> 
> I am a grad student and I am currently studying Flaky Tests. 
> As a part of my study , I’ve been examining the Check and the Gate jobs 
> in the OpenStack projects.I have been trying to identify why developers 
> run rechecks and how often 
> developers running a recheck helps in the identification of a Flaky 
> test.
> 
> To identify how often a recheck points to a Flaky test, I need the test 
> results of each of the rechecks.
> However, I have not been able to get this information from Gerrit for 
> each recheck comment.
> I was wondering if the history of the jobs run against a recheck comment 
> is available and if it can be retrieved. 
> 
> It would be great if I can get some pointers :)
> 

This information is known to Zuul, but I don't think Zuul currently records a flag to indicate results are due to some human triggered retry mechanism. One approach could be to add this functionality to Zuul and rely on the Zuul builds db for that data.

Another approach that doesn't require updates to Zuul is to parse Gerrit comments and flag things yourself. For check jobs they only run when a new patchset is pushed or when rechecked. This means the first results for a patchset are the initial set. Any subsequent results for that patchset from the check pipeline (indicated in the comment itself) are the result of rechecks.

The gate is a bit more complicated because shared gate queues can cause a change's tests to be rerun if a related change is rechecked. You can probably infer if the recheck was on this particular change by looking for previous recheck comments without results.

Unfortunately I don't know how clean the data is. I believe the Zuul comments have been very consistent over time, but don't know that for sure. You may want to start with both things. The first to make future data easier to consume and the second to have a way to get at the preexisting data.

Some other thoughts. Our job log retention time is quite short due to disk space contraints (~4 weeks?). While the Gerrit comments go back many years if you want to know what specific test case a tempest job failed you'll only be able to get that data for the last month or so.

We also try to index our job logs in elasticsearch and expose them via a kibana web ui and a subset of the elasticsearch API at http://logstash.openstack.org. More details at https://docs.openstack.org/infra/system-config/logstash.html. We are happy for people to use that for additional insight. Just please try to be nice to our cluster and we'd love it if you shared insights/results with us too.

Finally we do some tracking of what we think are reasons for rechecks with our "elastic-recheck" tool. It builds on top of the elasticsearch cluster above using bug fingerprint queries to track the occurrence of known issues. http://status.openstack.org/elastic-recheck/ renders graphs and the source repo for elastic-recheck has all the query fingerprints. Again feel free to use this tool if it is helpful, but we'd love insights/feedback/etc if you end up learning anything interesting with it.

Hope this was useful,
Clark



More information about the OpenStack-Infra mailing list