Gate fracas (status) update

12 Dec 2018

      I wanted to follow up from Clark's last gate status update [1]. Lots of 
work has been going on the last few weeks to try and get the gate under 
control since it's hard to merge code when you can't merge code. Most of 
my update is specific to nova, but some of it might be interesting to 
others as well for general QA/infra FYI. I'll group this into a few 
categories of issue.

Optimize node usage / reduce blast radius
-----------------------------------------

* The nova-cells-1 job has been moved to the experimental queue [2] so 
that we can still test that environment but do it on-demand. This helps 
us avoid latent cellsv1-specific race failures which might reset the gate.

* I'm trying to drop the nova-multiattach job which is partially 
redundant with tempest-full except it only runs tempest.compute.api 
tests and it also runs slow tests [3]. What is blocking this is 
test_volume_swap_with_multiattach in the tempest-slow job intermittently 
failing during volume cleanup [4]. I have put a lot of debug details 
into that bug and have a debug logging patch up as well to see if I can 
recreate the failure and sort out what is going wrong. The main 
difference between this test in the tempest-slow job and the 
nova-multiattach job is simply that the tempest-slow job is multinode 
and nova-multiattach is single node, and the single node aspect might 
have been hiding some weird race bugs when it came to disconnecting the 
volume to cleanup if the servers are all on the same host. Help 
debugging that would be appreciated.

Zuul queuing changes
--------------------

An infra thread [5] prompted some discussion in IRC which led to changes 
in how tripleo changes will be queued by zuul [6][7]. The idea here is 
to isolate tripleo changes into their own queue so failures in tripleo 
changes don't disrupt (or starve) changes in openstack projects (in 
general) from getting queued up for test nodes. tl;dr: nova changes 
should enqueue more like they used to before [5].

Gate bugs
---------

* http://status.openstack.org/elastic-recheck/#1807518

A fix for devstack was merged on both master and stable/rocky but we're 
still seeing this, so it probably needs more investigation.

* http://status.openstack.org/elastic-recheck/#1783405

We need another deep dive on tempest tests which aren't marked as slow 
but which might be slow and contributing to overall job timeouts.

* http://status.openstack.org/elastic-recheck/#1808171

This is relatively new. Debug notes are in the launchpad bug. In this 
test, a server is being created with 7 ports and times out waiting to 
get the network-vif-plugged event from neutron on the first port only, 
the other 6 events are received. Regarding the port that doesn't get the 
event, there are some weird messages in the neutron agent logs so there 
could be a race there but we likely need help from the neutron team to 
investigate. I'm not sure what might have prompted this, but maybe the 
recent change to use Ubuntu Bionic nodes and a new version of OVS is buggy?

* http://status.openstack.org/elastic-recheck/#1800472

The fix for this [8] merged a couple of hours ago and we're seeing it 
drop off the e-r graph.

* http://status.openstack.org/elastic-recheck/#1806912

There are a couple of separate nova bugs for this [9][10]. The fixes for 
both changes are approved and should reduce the amount of time it takes 
to start nova-api which will help avoid timeouts on slower test nodes. 
The fix for [10] also fixes a long-standing rolling upgrade issue so 
we'll be backporting that one.

* http://status.openstack.org/elastic-recheck/#1798688

There is a nova fix up for this [11] and has a +2, it's very simple and 
just needs another nova core (efried would get it but he's out until 
January).

* http://status.openstack.org/elastic-recheck/#1807520

This has been fixed on all grenade branches [12] and was very latent 
(goes back to pike) but only showed up on slower test nodes.

* http://status.openstack.org/elastic-recheck/#1808010

This is a real snowball issue where the cirros filesystem fills up so 
config drive fails, falling back to use the metadata API to get 
networking information but the metadata API response is too slow and 
cloud-init times out. I've got a related fix [13] but we likely need 
someone to help profile where our other inefficiencies are in responding 
the metadata API requests.

* http://status.openstack.org/elastic-recheck/#1808063

This one is also relatively new and I'm not sure what might be causing it.

----

There are other bugs in the e-r page but the hits are low enough, or 
they are latent enough, that I won't bother trying to detail them here.

[1] 
http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread....
[2] https://review.openstack.org/#/c/623538/
[3] https://review.openstack.org/#/q/topic:drop-multiattach-job
[4] https://bugs.launchpad.net/tempest/+bug/1807723
[5] 
http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread....
[6] https://review.openstack.org/#/c/623595/ - this is the zuul feature
[7] https://review.openstack.org/#/c/624246/ - the tripleo-ci change
[8] https://review.openstack.org/#/c/615347/
[9] https://bugs.launchpad.net/nova/+bug/1807219
[10] https://bugs.launchpad.net/nova/+bug/1807044
[11] https://review.openstack.org/#/c/623596
[12] 
https://review.openstack.org/#/q/I833d79ecc97ddc844bf156ab64477c7c77424f20
[13] https://review.openstack.org/#/c/624778

-- 

Thanks,

Matt

Gate fracas (status) update

Matt Riedemann