[placement][tripleo][infra] zuul job dependencies for greater good?
Bogdan Dobrelya
bdobreli at redhat.com
Thu Feb 28 14:23:34 UTC 2019
On 28.02.2019 15:16, Bogdan Dobrelya wrote:
>> On Mon, 2019-02-25 at 19:42 -0500, Clark Boylan wrote:
>>> On Mon, Feb 25, 2019, at 12:51 PM, Ben Nemec wrote:
>>>
>>
>> snip
>>
>>> That said, I wouldn't push too hard in either direction until someone
>>> crunched the numbers and figured out how much time it would have
>>> saved to not run long tests on patch sets with failing unit tests. I
>>> feel like it's probably possible to figure that out, and if so then
>>> we should do it before making any big decisions on this.
>>
>> For numbers the elastic-recheck tool [0] gives us fairly accurate
>> tracking of which issues in the system cause tests to fail. You can
>> use this as a starting point to potentially figure out how expensive
>> indentation errors caught by the pep8 jobs ends up being or how often
>> unittests fail. You probably need to tweak the queries there to get
>> that specific though.
>>
>> Periodically I also dump node resource utilization by project, repo,
>> and job [1]. I haven't automated this because Tobiash has written a
>> much better thing that has Zuul inject this into graphite and we
>> should be able to set up a grafana dashboard for that in the future
>> instead.
>>
>> These numbers won't tell a whole story, but should paint a fairly
>> accurate high level picture of the types of things we should look at
>> to be more node efficient and "time in gate" efficient. Looking at
>> these two really quickly myself it seems that job timeouts are a big
>> cost (anyone looking into why our jobs timeout?).
>>
>> [0] http://status.openstack.org/elastic-recheck/index.html
>> [1] http://paste.openstack.org/show/746083/
>>
>> Hope this helps,
>> Clark
>
> Here is some numbers [0] extracted via elastic-recheck console queries.
> It shows 6% of wasted failures because of tox issues in general, and 3%
> for tripleo projects in particular.
>
> My final take is, given some middle-ground solution, like I illustrated
> earlier this sub-thread, it might be worth it, and the effort for
> boosting up the total throughput of openstack CI system by a 6% is not
> so bad idea.
Also I forgot to mention that the 6% for numbers of failed jobs give a
non linear function for the saved resources pool, given the numbers [1]
from Clark:
Top 20 repos by resource usage:
openstack/neutron: 198367474.73s, 16.95%
openstack/tripleo-heat-templates: 174221496.83s, 14.89%
...
And a 3% cut off from t-h-t (14.89%) frees out 0.45% of total resources
into the pool, which could shorten total pipeline wait times for CI jobs
for, say manila by a *half*:
openstack/manila: 11137905.75s, 0.95%
[1] http://paste.openstack.org/show/746083/
>
> [0] http://paste.openstack.org/show/746503/
>
>
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
More information about the openstack-discuss
mailing list