Open Stack

Fri Sep 28 20:12:40 UTC 2018

On Wed, Sep 19, 2018, at 12:11 PM, Clark Boylan wrote:
> Hello everyone,
> 
> You may have noticed there is a large Zuul job backlog and changes are 
> not getting CI reports as quickly as you might expect. There are several 
> factors interacting with each other to make this the case. The short 
> version is that one of our clouds is performing upgrades and has been 
> removed from service, and we have a large number of gate failures which 
> cause things to reset and start over. We have fewer resources than 
> normal and are using them inefficiently. Zuul is operating as expected.
> 
> Continue reading if you'd like to understand the technical details and 
> find out how you can help make this better.
> 
> Zuul gates related projects in shared queues. Changes enter these queues 
> and are ordered in a speculative future state that Zuul assumes will 
> pass because multiple humans have reviewed the changes and said they are 
> good (also they had to pass check testing first). Problems arise when 
> tests fail forcing Zuul to evict changes from the speculative future 
> state, build a new state, then start jobs over again for this new 
> future.
> 
> Typically this doesn't happen often and we merge many changes at a time, 
> quickly pushing code into our repos. Unfortunately, the results are 
> painful when we fail often as we end up rebuilding future states and 
> restarting jobs often. Currently we have the gate and release jobs set 
> to the highest priority as well so they run jobs before other queues. 
> This means the gate can starve other work if it is flaky. We've 
> configured things this way because the gate is not supposed to be flaky 
> since we've reviewed things and already passed check testing. One of the 
> tools we have in place to make this less painful is each gate queue 
> operates on a window that grows and shrinks similar to how TCP 
> slowstart. As changes merge we increase the size of the window and when 
> they fail to merge we decrease it. This reduces the size of the future 
> state that must be rebuilt and retested on failure when things are 
> persistently flaky.
> 
> The best way to make this better is to fix the bugs in our software, 
> whether that is in the CI system itself or the software being tested. 
> The first step in doing that is to identify and track the bugs that we 
> are dealing with. We have a tool called elastic-recheck that does this 
> using indexed logs from the jobs. The idea there is to go through the 
> list of unclassified failures [0] and fingerprint them so that we can 
> track them [1]. With that data available we can then prioritize fixing 
> the bugs that have the biggest impact.
> 
> Unfortunately, right now our classification rate is very poor (only 
> 15%), which makes it difficult to know what exactly is causing these 
> failures. Mriedem and I have quickly scanned the unclassified list, and 
> it appears there is a db migration testing issue causing these tests to 
> timeout across several projects. Mriedem is working to get this 
> classified and tracked which should help, but we will also need to fix 
> the bug. On top of that it appears that Glance has flaky functional 
> tests (both python2 and python3) which are causing resets and should be 
> looked into.
> 
> If you'd like to help, let mriedem or myself know and we'll gladly work 
> with you to get elasticsearch queries added to elastic-recheck. We are 
> likely less help when it comes to fixing functional tests in Glance, but 
> I'm happy to point people in the right direction for that as much as I 
> can. If you can take a few minutes to do this before/after you issue a 
> recheck it does help quite a bit.
> 
> One general thing I've found would be helpful is if projects can clean 
> up the deprecation warnings in their log outputs. The persistent 
> "WARNING you used the old name for a thing" messages make the logs large 
> and much harder to read to find the actual failures.
> 
> As a final note this is largely targeted at the OpenStack Integrated 
> gate (Nova, Glance, Cinder, Keystone, Swift, Neutron) since that appears 
> to be particularly flaky at the moment. The Zuul behavior applies to 
> other gate pipelines (OSA, Tripleo, Airship, etc) as does elastic-
> recheck and related tooling. If you find your particular pipeline is 
> flaky I'm more than happy to help in that context as well.
> 
> [0] http://status.openstack.org/elastic-recheck/data/integrated_gate.html
> [1] http://status.openstack.org/elastic-recheck/gate.html

I was asked to write a followup to this as the long Zuul queues have persisted through this week. Largely because the situation from last week hasn't changed much. We were down the upgraded cloud region while we worked around a network configuration bug, then once that was addressed we ran into neutron port assignment and deletion issues. We think these are both fixed and we are running in this region again as of today.

Other good news is our classification rate is up significantly. We can use that information to go through the top identified gate bugs:

Network Connectivity issues to test nodes [2]. This is the current top of the list, but I think its impact is relatively small. What is happening here is jobs fail to connect to their test nodes early in the pre-run playbook and then fail. Zuul will rerun these jobs for us because they failed in the pre-run step. Prior to zuulv3 we had nodepool run a ready script before marking test nodes as ready, this script would've caught and filtered out these broken network nodes early. We now notice them late during the pre-run of a job.

Pip fails to find distribution for package [3]. Earlier in the week we had the in region mirror fail in two different regions for unrelated errors. These mirrors were fixed and the only other hits for this bug come from Ara which tried to install the 'black' package on python3.5 but this package requires python>=3.6.

yum, no more mirrors to try [4]. At first glance this appears to be an infrastructure issue because the mirror isn't serving content to yum. On further investigation it turned out to be a DNS resolution issue caused by the installation of designate in the tripleo jobs. Tripleo is aware of this issue and working to correct it.

Stackviz failing on py3 [5]. This is a real bug in stackviz caused by subunit data being binary not utf8 encoded strings. I've written a fix for this problem at https://review.openstack.org/606184, but in doing so found that this was a known issue back in March and there was already a proposed fix, https://review.openstack.org/#/c/555388/3. It would be helpful if the QA team could care for this project and get a fix in. Otherwise, we should consider disabling stackviz on our tempest jobs (though the output from stackviz is often useful).

There are other bugs being tracked by e-r. Some are bugs in the openstack software and I'm sure some are also bugs in the infrastructure. I have not yet had the time to work through the others though. It would be helpful if project teams could prioritize the debugging and fixing of these issues though.

[2] http://status.openstack.org/elastic-recheck/gate.html#1793370
[3] http://status.openstack.org/elastic-recheck/gate.html#1449136
[4] http://status.openstack.org/elastic-recheck/gate.html#1708704
[5] http://status.openstack.org/elastic-recheck/gate.html#1758054

Clark

Open Stack

[openstack-dev] [all] Zuul job backlog

OpenStack

Community

Documentation

Branding & Legal