[placement][tripleo][infra] zuul job dependencies for greater good?

Bogdan Dobrelya bdobreli at redhat.com
Thu Feb 28 16:38:57 UTC 2019


On 28.02.2019 17:35, Bogdan Dobrelya wrote:
> On 28.02.2019 17:22, Clark Boylan wrote:
>> On Thu, Feb 28, 2019, at 6:16 AM, Bogdan Dobrelya wrote:
>>>> On Mon, 2019-02-25 at 19:42 -0500, Clark Boylan wrote:
>>>>> On Mon, Feb 25, 2019, at 12:51 PM, Ben Nemec wrote:
>>>>>
>>>>
>>>> snip
>>>>
>>>>> That said, I wouldn't push too hard in either direction until someone
>>>>> crunched the numbers and figured out how much time it would have saved
>>>>> to not run long tests on patch sets with failing unit tests. I feel 
>>>>> like
>>>>> it's probably possible to figure that out, and if so then we should do
>>>>> it before making any big decisions on this.
>>>>
>>>> For numbers the elastic-recheck tool [0] gives us fairly accurate 
>>>> tracking of which issues in the system cause tests to fail. You can 
>>>> use this as a starting point to potentially figure out how expensive 
>>>> indentation errors caught by the pep8 jobs ends up being or how 
>>>> often unittests fail. You probably need to tweak the queries there 
>>>> to get that specific though.
>>>>
>>>> Periodically I also dump node resource utilization by project, repo, 
>>>> and job [1]. I haven't automated this because Tobiash has written a 
>>>> much better thing that has Zuul inject this into graphite and we 
>>>> should be able to set up a grafana dashboard for that in the future 
>>>> instead.
>>>>
>>>> These numbers won't tell a whole story, but should paint a fairly 
>>>> accurate high level picture of the types of things we should look at 
>>>> to be more node efficient and "time in gate" efficient. Looking at 
>>>> these two really quickly myself it seems that job timeouts are a big 
>>>> cost (anyone looking into why our jobs timeout?).
>>>>
>>>> [0] http://status.openstack.org/elastic-recheck/index.html
>>>> [1] http://paste.openstack.org/show/746083/
>>>>
>>>> Hope this helps,
>>>> Clark
>>>
>>> Here is some numbers [0] extracted via elastic-recheck console queries.
>>> It shows 6% of wasted failures because of tox issues in general, and 3%
>>> for tripleo projects in particular.
>>
>> Are these wasted failures? The queries appear to be tracking valid 
>> failures of those jobs. These valid failures are then actionable 
>> feedback for developers to fix their changes.
> 
> I'll need more time to get through the ideas below and wrap it around my 
> heaf, but for this particular statement, I understand the CI pool 
> resources as "wasted" if executed for either of the failed tox jobs. The 
> reason why it fails is secondary from that standpoint...

ugh... I meant the CI resources spent for running other integration 
jobs, while still having an unmergable patch here. As I noted above, 
*some* of the integration jobs may still be there in early results, just 
that probably not *all* of them. That is I called "middle-ground".

> 
>>
>> If we expect these failures to go away we'll need to be much more 
>> forceful about getting developers to run tox locally before they push.
>>
>> We need to compare (and this is a rough example) the resource usage 
>> and developer time of complete batch results where you have a pep8 
>> issue and an integration job issue in patchset one, fix both and all 
>> tests pass in patchset two against pep8 failure in patchset one, 
>> integration failure in patchset two, all tests pass in patchset three.
>>
>> Today:
>>    patchset one:
>>      pep8 FAILURE
>>      unittest SUCCESS
>>      integration FAILURE
>>    patchset two:
>>      pep8 SUCCESS
>>      unittest SUCCESS
>>      integration SUCCESS
>>
>> Proposed Future:
>>    patchset one:
>>      pep8 FAILURE
>>      unittest SUCCESS
>>    patchset two:
>>      pep8 SUCCESS
>>      unittest SUCCESS
>>      integration FAILURE
>>    patchset three:
>>      pep8 SUCCESS
>>      unittest SUCCESS
>>      integration SUCCESS
>>
>> There are strictly more patchsets (developer roundtrips) and tests run 
>> in my contrived example. Reality will depend on how many iterations we 
>> actually see in the real world (eg are we fixing bugs reliably based 
>> on test feedback and how often do unittest and integration tests fail 
>> for different reasons).
>>
>> Someone with a better background in statistics will probably tell me 
>> this approach is wrong, but using the elasticsearch tooling one 
>> approach may be to pick say 1k changes, then for each change identify 
>> which tests failed on subsequent patchsets? Then we'd be able to infer 
>> with some confidence the behaviors we have in the test suites around 
>> catching failures, whether they are independent across jobs and 
>> whether or not we fix them in batches.
>>
>> Finally, I think we might also want to consider what the ideal is 
>> here. If we find we can optimize the current system for current 
>> behaviors, we also need to consider if that is worthwhile given an 
>> ideal. Should developers be expected to run unittests and linters 
>> locally before pushing? If so then optimizing for when they don't 
>> might be effort better spent on making it easier to run the unittests 
>> and linters locally and educating developers on how to do so. I think 
>> we'd also ideally expect our tests to pass once they run in the gate 
>> pipeline. Unfortunately I think our elastic-recheck data shows they 
>> often don't and more effort in fixing those failures would provide a 
>> dramatic increase in throughput (due to the compounding costs of 
>> subsequent gate resets).
>>
>>>
>>> My final take is, given some middle-ground solution, like I illustrated
>>> earlier this sub-thread, it might be worth it, and the effort for
>>> boosting up the total throughput of openstack CI system by a 6% is not
>>> so bad idea.
>>>
>>> [0] http://paste.openstack.org/show/746503/
>>>
>>>
>>> -- 
>>> Best regards,
>>> Bogdan Dobrelya,
>>> Irc #bogdando
>>>
>>>
>>
> 
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando



More information about the openstack-discuss mailing list