[placement][tripleo][infra] zuul job dependencies for greater good?

Bogdan Dobrelya bdobreli at redhat.com
Thu Feb 28 14:23:34 UTC 2019


On 28.02.2019 15:16, Bogdan Dobrelya wrote:
>> On Mon, 2019-02-25 at 19:42 -0500, Clark Boylan wrote:
>>> On Mon, Feb 25, 2019, at 12:51 PM, Ben Nemec wrote:
>>>
>>
>> snip
>>
>>> That said, I wouldn't push too hard in either direction until someone 
>>> crunched the numbers and figured out how much time it would have 
>>> saved to not run long tests on patch sets with failing unit tests. I 
>>> feel like it's probably possible to figure that out, and if so then 
>>> we should do it before making any big decisions on this.
>>
>> For numbers the elastic-recheck tool [0] gives us fairly accurate 
>> tracking of which issues in the system cause tests to fail. You can 
>> use this as a starting point to potentially figure out how expensive 
>> indentation errors caught by the pep8 jobs ends up being or how often 
>> unittests fail. You probably need to tweak the queries there to get 
>> that specific though.
>>
>> Periodically I also dump node resource utilization by project, repo, 
>> and job [1]. I haven't automated this because Tobiash has written a 
>> much better thing that has Zuul inject this into graphite and we 
>> should be able to set up a grafana dashboard for that in the future 
>> instead.
>>
>> These numbers won't tell a whole story, but should paint a fairly 
>> accurate high level picture of the types of things we should look at 
>> to be more node efficient and "time in gate" efficient. Looking at 
>> these two really quickly myself it seems that job timeouts are a big 
>> cost (anyone looking into why our jobs timeout?).
>>
>> [0] http://status.openstack.org/elastic-recheck/index.html
>> [1] http://paste.openstack.org/show/746083/
>>
>> Hope this helps,
>> Clark
> 
> Here is some numbers [0] extracted via elastic-recheck console queries. 
> It shows 6% of wasted failures because of tox issues in general, and 3% 
> for tripleo projects in particular.
> 
> My final take is, given some middle-ground solution, like I illustrated 
> earlier this sub-thread, it might be worth it, and the effort for 
> boosting up the total throughput of openstack CI system by a 6% is not 
> so bad idea.

Also I forgot to mention that the 6% for numbers of failed jobs give a 
non linear function for the saved resources pool, given the numbers [1] 
from Clark:

Top 20 repos by resource usage:
openstack/neutron: 198367474.73s, 16.95%
openstack/tripleo-heat-templates: 174221496.83s, 14.89%
...

And a 3% cut off from t-h-t (14.89%) frees out 0.45% of total resources 
into the pool, which could shorten total pipeline wait times for CI jobs 
for, say manila by a *half*:

openstack/manila: 11137905.75s, 0.95%

[1] http://paste.openstack.org/show/746083/



> 
> [0] http://paste.openstack.org/show/746503/
> 
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando



More information about the openstack-discuss mailing list