I wanted to follow up from Clark's last gate status update [1]. Lots of work has been going on the last few weeks to try and get the gate under control since it's hard to merge code when you can't merge code. Most of my update is specific to nova, but some of it might be interesting to others as well for general QA/infra FYI. I'll group this into a few categories of issue. Optimize node usage / reduce blast radius ----------------------------------------- * The nova-cells-1 job has been moved to the experimental queue [2] so that we can still test that environment but do it on-demand. This helps us avoid latent cellsv1-specific race failures which might reset the gate. * I'm trying to drop the nova-multiattach job which is partially redundant with tempest-full except it only runs tempest.compute.api tests and it also runs slow tests [3]. What is blocking this is test_volume_swap_with_multiattach in the tempest-slow job intermittently failing during volume cleanup [4]. I have put a lot of debug details into that bug and have a debug logging patch up as well to see if I can recreate the failure and sort out what is going wrong. The main difference between this test in the tempest-slow job and the nova-multiattach job is simply that the tempest-slow job is multinode and nova-multiattach is single node, and the single node aspect might have been hiding some weird race bugs when it came to disconnecting the volume to cleanup if the servers are all on the same host. Help debugging that would be appreciated. Zuul queuing changes -------------------- An infra thread [5] prompted some discussion in IRC which led to changes in how tripleo changes will be queued by zuul [6][7]. The idea here is to isolate tripleo changes into their own queue so failures in tripleo changes don't disrupt (or starve) changes in openstack projects (in general) from getting queued up for test nodes. tl;dr: nova changes should enqueue more like they used to before [5]. Gate bugs --------- * http://status.openstack.org/elastic-recheck/#1807518 A fix for devstack was merged on both master and stable/rocky but we're still seeing this, so it probably needs more investigation. * http://status.openstack.org/elastic-recheck/#1783405 We need another deep dive on tempest tests which aren't marked as slow but which might be slow and contributing to overall job timeouts. * http://status.openstack.org/elastic-recheck/#1808171 This is relatively new. Debug notes are in the launchpad bug. In this test, a server is being created with 7 ports and times out waiting to get the network-vif-plugged event from neutron on the first port only, the other 6 events are received. Regarding the port that doesn't get the event, there are some weird messages in the neutron agent logs so there could be a race there but we likely need help from the neutron team to investigate. I'm not sure what might have prompted this, but maybe the recent change to use Ubuntu Bionic nodes and a new version of OVS is buggy? * http://status.openstack.org/elastic-recheck/#1800472 The fix for this [8] merged a couple of hours ago and we're seeing it drop off the e-r graph. * http://status.openstack.org/elastic-recheck/#1806912 There are a couple of separate nova bugs for this [9][10]. The fixes for both changes are approved and should reduce the amount of time it takes to start nova-api which will help avoid timeouts on slower test nodes. The fix for [10] also fixes a long-standing rolling upgrade issue so we'll be backporting that one. * http://status.openstack.org/elastic-recheck/#1798688 There is a nova fix up for this [11] and has a +2, it's very simple and just needs another nova core (efried would get it but he's out until January). * http://status.openstack.org/elastic-recheck/#1807520 This has been fixed on all grenade branches [12] and was very latent (goes back to pike) but only showed up on slower test nodes. * http://status.openstack.org/elastic-recheck/#1808010 This is a real snowball issue where the cirros filesystem fills up so config drive fails, falling back to use the metadata API to get networking information but the metadata API response is too slow and cloud-init times out. I've got a related fix [13] but we likely need someone to help profile where our other inefficiencies are in responding the metadata API requests. * http://status.openstack.org/elastic-recheck/#1808063 This one is also relatively new and I'm not sure what might be causing it. ---- There are other bugs in the e-r page but the hits are low enough, or they are latent enough, that I won't bother trying to detail them here. [1] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.... [2] https://review.openstack.org/#/c/623538/ [3] https://review.openstack.org/#/q/topic:drop-multiattach-job [4] https://bugs.launchpad.net/tempest/+bug/1807723 [5] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.... [6] https://review.openstack.org/#/c/623595/ - this is the zuul feature [7] https://review.openstack.org/#/c/624246/ - the tripleo-ci change [8] https://review.openstack.org/#/c/615347/ [9] https://bugs.launchpad.net/nova/+bug/1807219 [10] https://bugs.launchpad.net/nova/+bug/1807044 [11] https://review.openstack.org/#/c/623596 [12] https://review.openstack.org/#/q/I833d79ecc97ddc844bf156ab64477c7c77424f20 [13] https://review.openstack.org/#/c/624778 -- Thanks, Matt