Update on gate status for the new year
cboylan at sapwetik.org
Fri Jan 4 18:12:25 UTC 2019
I'm still not entirely caught up on everything after the holidays, but thought I would attempt to do another update on gate reliability issues since those were well received last month.
Overall things look pretty good based on elastic-recheck data. That said I think this is mostly due to low test volume over the holidays and our 10 day index window. We should revisit this next week or the week after to get a more accurate view of things.
On the infra team side of things we've got quota issues in a cloud region that has decreased our test node capacity. Waiting on people to return from holidays to take a look at that. We also started tracking hypervisor IDs for our test instances (thank you pabelanger) to try and help identify when specific hypervisors might be the cause of some of our issues. https://review.openstack.org/628642 is a followup to index that data with our job log data in Elasticsearch.
We've seen some ssh failures in tripleo jobs on limestone  and neutron and zuul report constrained IOPS there resulting in failed database migrations. I think the idea with 628642 is to see if we can narrow that down to specific hypervisors.
On the project side of things our categorization rates are quite low . If your changes are evicted from the gate due to failures it would be helpful if you could spend a few minutes to try and identify and fingerprint those failures.
We'll check back in a week or two when we should have a much better data set to look at.
More information about the openstack-discuss