[infra] Update on test throughput and Zuul backlogs

Ghanshyam Mann gmann at ghanshyammann.com
Sun Dec 9 13:57:54 UTC 2018


 ---- On Fri, 07 Dec 2018 08:50:30 +0900 Matt Riedemann <mriedemos at gmail.com> wrote ---- 
 > On 12/6/2018 5:16 PM, Clark Boylan wrote: 
 > > I was asked to write another one of these in the Nova meeting today so here goes. 
 >  
 > Thanks Clark, this is really helpful. 
 >  
 > >  
 > > One thing to keep in mind is that while the test nodes are slower than we'd like, they have also exposed some situations where our software is less efficient than we'd like. At least one bug,https://bugs.launchpad.net/nova/+bug/1807219, has been identified through this. I would encourage people debugging these slow tests to look to see if this exposes a deficiency in our software that can be fixed. 
 >  
 > That was split off from this: 
 >  
 > https://bugs.launchpad.net/nova/+bug/1807044 
 >  
 > But yeah a couple of issues Dan and I are digging into. 
 >  
 > Another thing I noticed in one of these nova-api start timeout failures  
 > in ovh-bhs1 was uwsgi seems to just stall for 26 seconds here: 
 >  
 > http://logs.openstack.org/01/619701/5/gate/tempest-slow/2bb461b/controller/logs/screen-n-api.txt.gz#_Dec_05_20_13_23_060958 
 >  
 > I pushed a patch to enable uwsgi debug logging: 
 >  
 > https://review.openstack.org/#/c/623265/ 
 >  
 > But of course I didn't (1) get a recreate or (2) seem to see any  
 > additional debug logging from uwsgi. If someone else knows how to enable  
 > that please let me know. 
 >  
 > >  
 > > These are the big issues that affect large numbers of projects (or even all of them), but there are still many project specific problems floating around as well. Unfortunately I haven't had much time to help dig into those recently (see broader issues above), but I think it would be helpful if projects can do some of that digging themselves. Also, a friendly reminder that we try to provide in cloud region mirrors and caches for commonly used resources like distro packages, pypi packages, dockerhub images, and so on. If your jobs aren't using these and you find they fail occasionally due to the Internet being flaky we'll be happy to help you update the jobs to use the in region resources instead. 
 >  
 > I'm not sure if this query is valid anymore: 
 >  
 > http://status.openstack.org/elastic-recheck/#1783405 
 >  
 > If it is, then we still have some tempest tests that aren't marked as  
 > slow but are contributing to job timeouts outside the tempest-slow job.  
 > I know the last time this came up, the QA team had a report of the  
 > slowest non-slow tests - can we get another one of those now? 

This seems still valid query. 7 fails in 24 hrs / 302 fails in 10 days. I did some more catagorization for this query with build_name and found failure are-
tempest-full or tempest-full-py3 - ~50%
tempest-all - 2 %
tempest-slow - 2%
rest all is in all other jobs.

I proposed to modify the query to exclude the tempest-all and tempest-slow job which runs all slow tests also. 
- https://review.openstack.org/#/c/623949/ 

On doing another round of marking slow tests, I will check if we can get more specific slow tests which are slow consistantly. 

-gmann

 >  
 > Another thing is, are there particular voting jobs that have a failure  
 > rate over 50% and are resetting the gate? If we do, we should consider  
 > making them non-voting while project teams work on fixing the issues.  
 > Because I've had approved patches for days now taking 13+ hours just to  
 > fail, which is pretty unsustainable. 
 >  
 > >  
 > > We'll keep pushing to fix the broader issues and are more than happy to help debug failures you hit within your projects as well. 
 > >  
 > > Hopefully this was helpful despite its length. 
 >  
 > Again, thank you Clark for taking the time to write up this summary -  
 > it's extremely useful. 
 >  
 > --  
 >  
 > Thanks, 
 >  
 > Matt 
 >  
 > 





More information about the openstack-discuss mailing list