[openstack-dev] [all] Zuul job backlog

Matt Riedemann mriedemos at gmail.com
Wed Sep 19 19:45:51 UTC 2018


On 9/19/2018 2:11 PM, Clark Boylan wrote:
> Unfortunately, right now our classification rate is very poor (only 15%), which makes it difficult to know what exactly is causing these failures. Mriedem and I have quickly scanned the unclassified list, and it appears there is a db migration testing issue causing these tests to timeout across several projects. Mriedem is working to get this classified and tracked which should help, but we will also need to fix the bug. On top of that it appears that Glance has flaky functional tests (both python2 and python3) which are causing resets and should be looked into.
> 
> If you'd like to help, let mriedem or myself know and we'll gladly work with you to get elasticsearch queries added to elastic-recheck. We are likely less help when it comes to fixing functional tests in Glance, but I'm happy to point people in the right direction for that as much as I can. If you can take a few minutes to do this before/after you issue a recheck it does help quite a bit.

Things have gotten bad enough that I've started proposing changes to 
skip particularly high failure rate tests that are not otherwise getting 
attention to help triage and fix the bugs. For example:

https://review.openstack.org/#/c/602649/

https://review.openstack.org/#/c/602656/

Generally this is a last resort since it means we're losing test 
coverage, but when we hit a critical mass of random failures it becomes 
extremely difficult to merge code.

Another one we need to make a decision on is:

https://bugs.launchpad.net/tempest/+bug/1783405

Which I'm suggesting we need to mark more slow tests with the actual 
"slow" tag in Tempest so they move to only be run in the tempest-slow 
job. gmann and I talked about this last week over IRC but I forgot to 
update the bug report with details. I think rather than increase the 
timeout of the tempest-full job we should be marking more slow tests as 
slow. Increasing timeouts gives some short-term relief but eventually we 
just have to look at these issues again, and a tempest run shouldn't 
take over 2 hours (remember when it used to take ~45 minutes?).

-- 

Thanks,

Matt



More information about the OpenStack-dev mailing list