[placement][tripleo][infra] zuul job dependencies for greater good?

Bogdan Dobrelya bdobreli at redhat.com
Thu Feb 28 16:35:50 UTC 2019


On 28.02.2019 17:22, Clark Boylan wrote:
> On Thu, Feb 28, 2019, at 6:16 AM, Bogdan Dobrelya wrote:
>>> On Mon, 2019-02-25 at 19:42 -0500, Clark Boylan wrote:
>>>> On Mon, Feb 25, 2019, at 12:51 PM, Ben Nemec wrote:
>>>>
>>>
>>> snip
>>>
>>>> That said, I wouldn't push too hard in either direction until someone
>>>> crunched the numbers and figured out how much time it would have saved
>>>> to not run long tests on patch sets with failing unit tests. I feel like
>>>> it's probably possible to figure that out, and if so then we should do
>>>> it before making any big decisions on this.
>>>
>>> For numbers the elastic-recheck tool [0] gives us fairly accurate tracking of which issues in the system cause tests to fail. You can use this as a starting point to potentially figure out how expensive indentation errors caught by the pep8 jobs ends up being or how often unittests fail. You probably need to tweak the queries there to get that specific though.
>>>
>>> Periodically I also dump node resource utilization by project, repo, and job [1]. I haven't automated this because Tobiash has written a much better thing that has Zuul inject this into graphite and we should be able to set up a grafana dashboard for that in the future instead.
>>>
>>> These numbers won't tell a whole story, but should paint a fairly accurate high level picture of the types of things we should look at to be more node efficient and "time in gate" efficient. Looking at these two really quickly myself it seems that job timeouts are a big cost (anyone looking into why our jobs timeout?).
>>>
>>> [0] http://status.openstack.org/elastic-recheck/index.html
>>> [1] http://paste.openstack.org/show/746083/
>>>
>>> Hope this helps,
>>> Clark
>>
>> Here is some numbers [0] extracted via elastic-recheck console queries.
>> It shows 6% of wasted failures because of tox issues in general, and 3%
>> for tripleo projects in particular.
> 
> Are these wasted failures? The queries appear to be tracking valid failures of those jobs. These valid failures are then actionable feedback for developers to fix their changes.

I'll need more time to get through the ideas below and wrap it around my 
heaf, but for this particular statement, I understand the CI pool 
resources as "wasted" if executed for either of the failed tox jobs. The 
reason why it fails is secondary from that standpoint...

> 
> If we expect these failures to go away we'll need to be much more forceful about getting developers to run tox locally before they push.
> 
> We need to compare (and this is a rough example) the resource usage and developer time of complete batch results where you have a pep8 issue and an integration job issue in patchset one, fix both and all tests pass in patchset two against pep8 failure in patchset one, integration failure in patchset two, all tests pass in patchset three.
> 
> Today:
>    patchset one:
>      pep8 FAILURE
>      unittest SUCCESS
>      integration FAILURE
>    patchset two:
>      pep8 SUCCESS
>      unittest SUCCESS
>      integration SUCCESS
> 
> Proposed Future:
>    patchset one:
>      pep8 FAILURE
>      unittest SUCCESS
>    patchset two:
>      pep8 SUCCESS
>      unittest SUCCESS
>      integration FAILURE
>    patchset three:
>      pep8 SUCCESS
>      unittest SUCCESS
>      integration SUCCESS
> 
> There are strictly more patchsets (developer roundtrips) and tests run in my contrived example. Reality will depend on how many iterations we actually see in the real world (eg are we fixing bugs reliably based on test feedback and how often do unittest and integration tests fail for different reasons).
> 
> Someone with a better background in statistics will probably tell me this approach is wrong, but using the elasticsearch tooling one approach may be to pick say 1k changes, then for each change identify which tests failed on subsequent patchsets? Then we'd be able to infer with some confidence the behaviors we have in the test suites around catching failures, whether they are independent across jobs and whether or not we fix them in batches.
> 
> Finally, I think we might also want to consider what the ideal is here. If we find we can optimize the current system for current behaviors, we also need to consider if that is worthwhile given an ideal. Should developers be expected to run unittests and linters locally before pushing? If so then optimizing for when they don't might be effort better spent on making it easier to run the unittests and linters locally and educating developers on how to do so. I think we'd also ideally expect our tests to pass once they run in the gate pipeline. Unfortunately I think our elastic-recheck data shows they often don't and more effort in fixing those failures would provide a dramatic increase in throughput (due to the compounding costs of subsequent gate resets).
> 
>>
>> My final take is, given some middle-ground solution, like I illustrated
>> earlier this sub-thread, it might be worth it, and the effort for
>> boosting up the total throughput of openstack CI system by a 6% is not
>> so bad idea.
>>
>> [0] http://paste.openstack.org/show/746503/
>>
>>
>> -- 
>> Best regards,
>> Bogdan Dobrelya,
>> Irc #bogdando
>>
>>
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando



More information about the openstack-discuss mailing list