Gate fracas (status) update
I wanted to follow up from Clark's last gate status update [1]. Lots of work has been going on the last few weeks to try and get the gate under control since it's hard to merge code when you can't merge code. Most of my update is specific to nova, but some of it might be interesting to others as well for general QA/infra FYI. I'll group this into a few categories of issue. Optimize node usage / reduce blast radius ----------------------------------------- * The nova-cells-1 job has been moved to the experimental queue [2] so that we can still test that environment but do it on-demand. This helps us avoid latent cellsv1-specific race failures which might reset the gate. * I'm trying to drop the nova-multiattach job which is partially redundant with tempest-full except it only runs tempest.compute.api tests and it also runs slow tests [3]. What is blocking this is test_volume_swap_with_multiattach in the tempest-slow job intermittently failing during volume cleanup [4]. I have put a lot of debug details into that bug and have a debug logging patch up as well to see if I can recreate the failure and sort out what is going wrong. The main difference between this test in the tempest-slow job and the nova-multiattach job is simply that the tempest-slow job is multinode and nova-multiattach is single node, and the single node aspect might have been hiding some weird race bugs when it came to disconnecting the volume to cleanup if the servers are all on the same host. Help debugging that would be appreciated. Zuul queuing changes -------------------- An infra thread [5] prompted some discussion in IRC which led to changes in how tripleo changes will be queued by zuul [6][7]. The idea here is to isolate tripleo changes into their own queue so failures in tripleo changes don't disrupt (or starve) changes in openstack projects (in general) from getting queued up for test nodes. tl;dr: nova changes should enqueue more like they used to before [5]. Gate bugs --------- * http://status.openstack.org/elastic-recheck/#1807518 A fix for devstack was merged on both master and stable/rocky but we're still seeing this, so it probably needs more investigation. * http://status.openstack.org/elastic-recheck/#1783405 We need another deep dive on tempest tests which aren't marked as slow but which might be slow and contributing to overall job timeouts. * http://status.openstack.org/elastic-recheck/#1808171 This is relatively new. Debug notes are in the launchpad bug. In this test, a server is being created with 7 ports and times out waiting to get the network-vif-plugged event from neutron on the first port only, the other 6 events are received. Regarding the port that doesn't get the event, there are some weird messages in the neutron agent logs so there could be a race there but we likely need help from the neutron team to investigate. I'm not sure what might have prompted this, but maybe the recent change to use Ubuntu Bionic nodes and a new version of OVS is buggy? * http://status.openstack.org/elastic-recheck/#1800472 The fix for this [8] merged a couple of hours ago and we're seeing it drop off the e-r graph. * http://status.openstack.org/elastic-recheck/#1806912 There are a couple of separate nova bugs for this [9][10]. The fixes for both changes are approved and should reduce the amount of time it takes to start nova-api which will help avoid timeouts on slower test nodes. The fix for [10] also fixes a long-standing rolling upgrade issue so we'll be backporting that one. * http://status.openstack.org/elastic-recheck/#1798688 There is a nova fix up for this [11] and has a +2, it's very simple and just needs another nova core (efried would get it but he's out until January). * http://status.openstack.org/elastic-recheck/#1807520 This has been fixed on all grenade branches [12] and was very latent (goes back to pike) but only showed up on slower test nodes. * http://status.openstack.org/elastic-recheck/#1808010 This is a real snowball issue where the cirros filesystem fills up so config drive fails, falling back to use the metadata API to get networking information but the metadata API response is too slow and cloud-init times out. I've got a related fix [13] but we likely need someone to help profile where our other inefficiencies are in responding the metadata API requests. * http://status.openstack.org/elastic-recheck/#1808063 This one is also relatively new and I'm not sure what might be causing it. ---- There are other bugs in the e-r page but the hits are low enough, or they are latent enough, that I won't bother trying to detail them here. [1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.... [2] https://review.openstack.org/#/c/623538/ [3] https://review.openstack.org/#/q/topic:drop-multiattach-job [4] https://bugs.launchpad.net/tempest/+bug/1807723 [5] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.... [6] https://review.openstack.org/#/c/623595/ - this is the zuul feature [7] https://review.openstack.org/#/c/624246/ - the tripleo-ci change [8] https://review.openstack.org/#/c/615347/ [9] https://bugs.launchpad.net/nova/+bug/1807219 [10] https://bugs.launchpad.net/nova/+bug/1807044 [11] https://review.openstack.org/#/c/623596 [12] https://review.openstack.org/#/q/I833d79ecc97ddc844bf156ab64477c7c77424f20 [13] https://review.openstack.org/#/c/624778 -- Thanks, Matt
On Wed, Dec 12, 2018, at 11:50 AM, Matt Riedemann wrote:
I wanted to follow up from Clark's last gate status update [1]. Lots of work has been going on the last few weeks to try and get the gate under control since it's hard to merge code when you can't merge code. Most of my update is specific to nova, but some of it might be interesting to others as well for general QA/infra FYI. I'll group this into a few categories of issue.
Now a few days later I figure it is worth another update.
Snip
Zuul queuing changes --------------------
An infra thread [5] prompted some discussion in IRC which led to changes in how tripleo changes will be queued by zuul [6][7]. The idea here is to isolate tripleo changes into their own queue so failures in tripleo changes don't disrupt (or starve) changes in openstack projects (in general) from getting queued up for test nodes. tl;dr: nova changes should enqueue more like they used to before [5].
https://review.openstack.org/#/c/625645/ has merged which reorgs projects based on their logical groups when it comes to relative priority. We hope this is a fairer accounting of priority.
Gate bugs ---------
Snip
* http://status.openstack.org/elastic-recheck/#1808010
This is a real snowball issue where the cirros filesystem fills up so config drive fails, falling back to use the metadata API to get networking information but the metadata API response is too slow and cloud-init times out. I've got a related fix [13] but we likely need someone to help profile where our other inefficiencies are in responding the metadata API requests.
Devstack has updated to using the cirros 0.3.6 image which should fix the config drive support in cirros. This means that config drive based tests will be tested properly now, but any tests relying on metadata server will be affected if it is slow.
* http://status.openstack.org/elastic-recheck/#1808063
This one is also relatively new and I'm not sure what might be causing it.
* http://status.openstack.org/elastic-recheck/index.html#1708704 This bug is tracking flaky yum installs. From what I have seen this is largely due to centos.org repos being unreliable and jobs not using our in cloud region mirrors. We updated multinode setup on centos in zuul-jobs (https://review.openstack.org/#/c/624817/) to address one case of this, but other jobs are seeing this too. If you run jobs against centos7 you may want to double check that this query doesn't affect your jobs (and fix the jobs if they do). Another change that went in was an update to devstack, https://review.openstack.org/#/c/625269/, to have losetup enable direct-io with its loopback devices. The thought here is that it may make cinder tests which rely on lvm on loopback devices more reliable.
----
There are other bugs in the e-r page but the hits are low enough, or they are latent enough, that I won't bother trying to detail them here.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.... [2] https://review.openstack.org/#/c/623538/ [3] https://review.openstack.org/#/q/topic:drop-multiattach-job [4] https://bugs.launchpad.net/tempest/+bug/1807723 [5] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.... [6] https://review.openstack.org/#/c/623595/ - this is the zuul feature [7] https://review.openstack.org/#/c/624246/ - the tripleo-ci change [8] https://review.openstack.org/#/c/615347/ [9] https://bugs.launchpad.net/nova/+bug/1807219 [10] https://bugs.launchpad.net/nova/+bug/1807044 [11] https://review.openstack.org/#/c/623596 [12] https://review.openstack.org/#/q/I833d79ecc97ddc844bf156ab64477c7c77424f20 [13] https://review.openstack.org/#/c/624778
We've seen people show up across many projects to help debug and fix a variety of issues over the last week or two. Thank you to everyone that has helped, it does seem like the gate is a bit happier in recent days (though that may also be reduction in demand due to holidays). That said there is still quite a bit to clean up based on e-r data. Also our classification rate is still only about 60% so that can be improved too. All this to say don't let the holiday break undo the progress we've made. I look forward to continuing to debug this stuff with you in the new year. Clark
participants (2)
-
Clark Boylan
-
Matt Riedemann