Gate fracas (status) update
Matt Riedemann
mriedemos at gmail.com
Wed Dec 12 19:50:53 UTC 2018
I wanted to follow up from Clark's last gate status update [1]. Lots of
work has been going on the last few weeks to try and get the gate under
control since it's hard to merge code when you can't merge code. Most of
my update is specific to nova, but some of it might be interesting to
others as well for general QA/infra FYI. I'll group this into a few
categories of issue.
Optimize node usage / reduce blast radius
-----------------------------------------
* The nova-cells-1 job has been moved to the experimental queue [2] so
that we can still test that environment but do it on-demand. This helps
us avoid latent cellsv1-specific race failures which might reset the gate.
* I'm trying to drop the nova-multiattach job which is partially
redundant with tempest-full except it only runs tempest.compute.api
tests and it also runs slow tests [3]. What is blocking this is
test_volume_swap_with_multiattach in the tempest-slow job intermittently
failing during volume cleanup [4]. I have put a lot of debug details
into that bug and have a debug logging patch up as well to see if I can
recreate the failure and sort out what is going wrong. The main
difference between this test in the tempest-slow job and the
nova-multiattach job is simply that the tempest-slow job is multinode
and nova-multiattach is single node, and the single node aspect might
have been hiding some weird race bugs when it came to disconnecting the
volume to cleanup if the servers are all on the same host. Help
debugging that would be appreciated.
Zuul queuing changes
--------------------
An infra thread [5] prompted some discussion in IRC which led to changes
in how tripleo changes will be queued by zuul [6][7]. The idea here is
to isolate tripleo changes into their own queue so failures in tripleo
changes don't disrupt (or starve) changes in openstack projects (in
general) from getting queued up for test nodes. tl;dr: nova changes
should enqueue more like they used to before [5].
Gate bugs
---------
* http://status.openstack.org/elastic-recheck/#1807518
A fix for devstack was merged on both master and stable/rocky but we're
still seeing this, so it probably needs more investigation.
* http://status.openstack.org/elastic-recheck/#1783405
We need another deep dive on tempest tests which aren't marked as slow
but which might be slow and contributing to overall job timeouts.
* http://status.openstack.org/elastic-recheck/#1808171
This is relatively new. Debug notes are in the launchpad bug. In this
test, a server is being created with 7 ports and times out waiting to
get the network-vif-plugged event from neutron on the first port only,
the other 6 events are received. Regarding the port that doesn't get the
event, there are some weird messages in the neutron agent logs so there
could be a race there but we likely need help from the neutron team to
investigate. I'm not sure what might have prompted this, but maybe the
recent change to use Ubuntu Bionic nodes and a new version of OVS is buggy?
* http://status.openstack.org/elastic-recheck/#1800472
The fix for this [8] merged a couple of hours ago and we're seeing it
drop off the e-r graph.
* http://status.openstack.org/elastic-recheck/#1806912
There are a couple of separate nova bugs for this [9][10]. The fixes for
both changes are approved and should reduce the amount of time it takes
to start nova-api which will help avoid timeouts on slower test nodes.
The fix for [10] also fixes a long-standing rolling upgrade issue so
we'll be backporting that one.
* http://status.openstack.org/elastic-recheck/#1798688
There is a nova fix up for this [11] and has a +2, it's very simple and
just needs another nova core (efried would get it but he's out until
January).
* http://status.openstack.org/elastic-recheck/#1807520
This has been fixed on all grenade branches [12] and was very latent
(goes back to pike) but only showed up on slower test nodes.
* http://status.openstack.org/elastic-recheck/#1808010
This is a real snowball issue where the cirros filesystem fills up so
config drive fails, falling back to use the metadata API to get
networking information but the metadata API response is too slow and
cloud-init times out. I've got a related fix [13] but we likely need
someone to help profile where our other inefficiencies are in responding
the metadata API requests.
* http://status.openstack.org/elastic-recheck/#1808063
This one is also relatively new and I'm not sure what might be causing it.
----
There are other bugs in the e-r page but the hits are low enough, or
they are latent enough, that I won't bother trying to detail them here.
[1]
http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.html#709
[2] https://review.openstack.org/#/c/623538/
[3] https://review.openstack.org/#/q/topic:drop-multiattach-job
[4] https://bugs.launchpad.net/tempest/+bug/1807723
[5]
http://lists.openstack.org/pipermail/openstack-discuss/2018-December/thread.html#482
[6] https://review.openstack.org/#/c/623595/ - this is the zuul feature
[7] https://review.openstack.org/#/c/624246/ - the tripleo-ci change
[8] https://review.openstack.org/#/c/615347/
[9] https://bugs.launchpad.net/nova/+bug/1807219
[10] https://bugs.launchpad.net/nova/+bug/1807044
[11] https://review.openstack.org/#/c/623596
[12]
https://review.openstack.org/#/q/I833d79ecc97ddc844bf156ab64477c7c77424f20
[13] https://review.openstack.org/#/c/624778
--
Thanks,
Matt
More information about the openstack-discuss
mailing list