On 28.02.2019 15:16, Bogdan Dobrelya wrote:
On Mon, 2019-02-25 at 19:42 -0500, Clark Boylan wrote:
On Mon, Feb 25, 2019, at 12:51 PM, Ben Nemec wrote:
snip
That said, I wouldn't push too hard in either direction until someone crunched the numbers and figured out how much time it would have saved to not run long tests on patch sets with failing unit tests. I feel like it's probably possible to figure that out, and if so then we should do it before making any big decisions on this.
For numbers the elastic-recheck tool [0] gives us fairly accurate tracking of which issues in the system cause tests to fail. You can use this as a starting point to potentially figure out how expensive indentation errors caught by the pep8 jobs ends up being or how often unittests fail. You probably need to tweak the queries there to get that specific though.
Periodically I also dump node resource utilization by project, repo, and job [1]. I haven't automated this because Tobiash has written a much better thing that has Zuul inject this into graphite and we should be able to set up a grafana dashboard for that in the future instead.
These numbers won't tell a whole story, but should paint a fairly accurate high level picture of the types of things we should look at to be more node efficient and "time in gate" efficient. Looking at these two really quickly myself it seems that job timeouts are a big cost (anyone looking into why our jobs timeout?).
[0] http://status.openstack.org/elastic-recheck/index.html [1] http://paste.openstack.org/show/746083/
Hope this helps, Clark
Here is some numbers [0] extracted via elastic-recheck console queries. It shows 6% of wasted failures because of tox issues in general, and 3% for tripleo projects in particular.
My final take is, given some middle-ground solution, like I illustrated earlier this sub-thread, it might be worth it, and the effort for boosting up the total throughput of openstack CI system by a 6% is not so bad idea.
Also I forgot to mention that the 6% for numbers of failed jobs give a non linear function for the saved resources pool, given the numbers [1] from Clark: Top 20 repos by resource usage: openstack/neutron: 198367474.73s, 16.95% openstack/tripleo-heat-templates: 174221496.83s, 14.89% ... And a 3% cut off from t-h-t (14.89%) frees out 0.45% of total resources into the pool, which could shorten total pipeline wait times for CI jobs for, say manila by a *half*: openstack/manila: 11137905.75s, 0.95% [1] http://paste.openstack.org/show/746083/
-- Best regards, Bogdan Dobrelya, Irc #bogdando