Gate fracas (status) update

Clark Boylan cboylan at sapwetik.org
Mon Dec 17 21:55:14 UTC 2018


On Wed, Dec 12, 2018, at 11:50 AM, Matt Riedemann wrote:
> I wanted to follow up from Clark's last gate status update [1]. Lots of 
> work has been going on the last few weeks to try and get the gate under 
> control since it's hard to merge code when you can't merge code. Most of 
> my update is specific to nova, but some of it might be interesting to 
> others as well for general QA/infra FYI. I'll group this into a few 
> categories of issue.

Now a few days later I figure it is worth another update.

> 

Snip

> 
> Zuul queuing changes
> --------------------
> 
> An infra thread [5] prompted some discussion in IRC which led to changes 
> in how tripleo changes will be queued by zuul [6][7]. The idea here is 
> to isolate tripleo changes into their own queue so failures in tripleo 
> changes don't disrupt (or starve) changes in openstack projects (in 
> general) from getting queued up for test nodes. tl;dr: nova changes 
> should enqueue more like they used to before [5].

https://review.openstack.org/#/c/625645/ has merged which reorgs projects based on their logical groups when it comes to relative priority. We hope this is a fairer accounting of priority.

> 
> Gate bugs
> ---------
> 

Snip

> 
> * http://status.openstack.org/elastic-recheck/#1808010
> 
> This is a real snowball issue where the cirros filesystem fills up so 
> config drive fails, falling back to use the metadata API to get 
> networking information but the metadata API response is too slow and 
> cloud-init times out. I've got a related fix [13] but we likely need 
> someone to help profile where our other inefficiencies are in responding 
> the metadata API requests.

Devstack has updated to using the cirros 0.3.6 image which should fix the config drive support in cirros. This means that config drive based tests will be tested properly now, but any tests relying on metadata server will be affected if it is slow.

> 
> * http://status.openstack.org/elastic-recheck/#1808063
> 
> This one is also relatively new and I'm not sure what might be causing it.
> 

* http://status.openstack.org/elastic-recheck/index.html#1708704

This bug is tracking flaky yum installs. From what I have seen this is largely due to centos.org repos being unreliable and jobs not using our in cloud region mirrors. We updated multinode setup on centos in zuul-jobs (https://review.openstack.org/#/c/624817/) to address one case of this, but other jobs are seeing this too. If you run jobs against centos7 you may want to double check that this query doesn't affect your jobs (and fix the jobs if they do).

Another change that went in was an update to devstack, https://review.openstack.org/#/c/625269/, to have losetup enable direct-io with its loopback devices. The thought here is that it may make cinder tests which rely on lvm on loopback devices more reliable.

> ----
> 
> There are other bugs in the e-r page but the hits are low enough, or 
> they are latent enough, that I won't bother trying to detail them here.
> 
> [1] 
> http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.html#709
> [2] https://review.openstack.org/#/c/623538/
> [3] https://review.openstack.org/#/q/topic:drop-multiattach-job
> [4] https://bugs.launchpad.net/tempest/+bug/1807723
> [5] 
> http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.html#482
> [6] https://review.openstack.org/#/c/623595/ - this is the zuul feature
> [7] https://review.openstack.org/#/c/624246/ - the tripleo-ci change
> [8] https://review.openstack.org/#/c/615347/
> [9] https://bugs.launchpad.net/nova/+bug/1807219
> [10] https://bugs.launchpad.net/nova/+bug/1807044
> [11] https://review.openstack.org/#/c/623596
> [12] 
> https://review.openstack.org/#/q/I833d79ecc97ddc844bf156ab64477c7c77424f20
> [13] https://review.openstack.org/#/c/624778
> 

We've seen people show up across many projects to help debug and fix a variety of issues over the last week or two. Thank you to everyone that has helped, it does seem like the gate is a bit happier in recent days (though that may also be reduction in demand due to holidays).

That said there is still quite a bit to clean up based on e-r data. Also our classification rate is still only about 60% so that can be improved too. All this to say don't let the holiday break undo the progress we've made. I look forward to continuing to debug this stuff with you in the new year.

Clark



More information about the openstack-discuss mailing list