Re: [placement][tripleo][infra] zuul job dependencies for greater good?

28 Feb 2019


      On 28.02.2019 15:16, Bogdan Dobrelya wrote:
...
...
On Mon, 2019-02-25 at 19:42 -0500, Clark Boylan wrote:
...
On Mon, Feb 25, 2019, at 12:51 PM, Ben Nemec wrote:
snip
...
That said, I wouldn't push too hard in either direction until someone 
crunched the numbers and figured out how much time it would have 
saved to not run long tests on patch sets with failing unit tests. I 
feel like it's probably possible to figure that out, and if so then 
we should do it before making any big decisions on this.
For numbers the elastic-recheck tool [0] gives us fairly accurate 
tracking of which issues in the system cause tests to fail. You can 
use this as a starting point to potentially figure out how expensive 
indentation errors caught by the pep8 jobs ends up being or how often 
unittests fail. You probably need to tweak the queries there to get 
that specific though.
Periodically I also dump node resource utilization by project, repo, 
and job [1]. I haven't automated this because Tobiash has written a 
much better thing that has Zuul inject this into graphite and we 
should be able to set up a grafana dashboard for that in the future 
instead.
These numbers won't tell a whole story, but should paint a fairly 
accurate high level picture of the types of things we should look at 
to be more node efficient and "time in gate" efficient. Looking at 
these two really quickly myself it seems that job timeouts are a big 
cost (anyone looking into why our jobs timeout?).
[0] http://status.openstack.org/elastic-recheck/index.html
[1] http://paste.openstack.org/show/746083/
Hope this helps,
Clark
Here is some numbers [0] extracted via elastic-recheck console queries. 
It shows 6% of wasted failures because of tox issues in general, and 3% 
for tripleo projects in particular.
My final take is, given some middle-ground solution, like I illustrated 
earlier this sub-thread, it might be worth it, and the effort for 
boosting up the total throughput of openstack CI system by a 6% is not 
so bad idea.
Also I forgot to mention that the 6% for numbers of failed jobs give a 
non linear function for the saved resources pool, given the numbers [1] 
from Clark:

Top 20 repos by resource usage:
openstack/neutron: 198367474.73s, 16.95%
openstack/tripleo-heat-templates: 174221496.83s, 14.89%
...

And a 3% cut off from t-h-t (14.89%) frees out 0.45% of total resources 
into the pool, which could shorten total pipeline wait times for CI jobs 
for, say manila by a *half*:

openstack/manila: 11137905.75s, 0.95%

[1] http://paste.openstack.org/show/746083/
...
[0] http://paste.openstack.org/show/746503/
-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando