[placement][tripleo][infra] zuul job dependencies for greater good?
Bogdan Dobrelya
bdobreli at redhat.com
Thu Feb 28 16:38:57 UTC 2019
On 28.02.2019 17:35, Bogdan Dobrelya wrote:
> On 28.02.2019 17:22, Clark Boylan wrote:
>> On Thu, Feb 28, 2019, at 6:16 AM, Bogdan Dobrelya wrote:
>>>> On Mon, 2019-02-25 at 19:42 -0500, Clark Boylan wrote:
>>>>> On Mon, Feb 25, 2019, at 12:51 PM, Ben Nemec wrote:
>>>>>
>>>>
>>>> snip
>>>>
>>>>> That said, I wouldn't push too hard in either direction until someone
>>>>> crunched the numbers and figured out how much time it would have saved
>>>>> to not run long tests on patch sets with failing unit tests. I feel
>>>>> like
>>>>> it's probably possible to figure that out, and if so then we should do
>>>>> it before making any big decisions on this.
>>>>
>>>> For numbers the elastic-recheck tool [0] gives us fairly accurate
>>>> tracking of which issues in the system cause tests to fail. You can
>>>> use this as a starting point to potentially figure out how expensive
>>>> indentation errors caught by the pep8 jobs ends up being or how
>>>> often unittests fail. You probably need to tweak the queries there
>>>> to get that specific though.
>>>>
>>>> Periodically I also dump node resource utilization by project, repo,
>>>> and job [1]. I haven't automated this because Tobiash has written a
>>>> much better thing that has Zuul inject this into graphite and we
>>>> should be able to set up a grafana dashboard for that in the future
>>>> instead.
>>>>
>>>> These numbers won't tell a whole story, but should paint a fairly
>>>> accurate high level picture of the types of things we should look at
>>>> to be more node efficient and "time in gate" efficient. Looking at
>>>> these two really quickly myself it seems that job timeouts are a big
>>>> cost (anyone looking into why our jobs timeout?).
>>>>
>>>> [0] http://status.openstack.org/elastic-recheck/index.html
>>>> [1] http://paste.openstack.org/show/746083/
>>>>
>>>> Hope this helps,
>>>> Clark
>>>
>>> Here is some numbers [0] extracted via elastic-recheck console queries.
>>> It shows 6% of wasted failures because of tox issues in general, and 3%
>>> for tripleo projects in particular.
>>
>> Are these wasted failures? The queries appear to be tracking valid
>> failures of those jobs. These valid failures are then actionable
>> feedback for developers to fix their changes.
>
> I'll need more time to get through the ideas below and wrap it around my
> heaf, but for this particular statement, I understand the CI pool
> resources as "wasted" if executed for either of the failed tox jobs. The
> reason why it fails is secondary from that standpoint...
ugh... I meant the CI resources spent for running other integration
jobs, while still having an unmergable patch here. As I noted above,
*some* of the integration jobs may still be there in early results, just
that probably not *all* of them. That is I called "middle-ground".
>
>>
>> If we expect these failures to go away we'll need to be much more
>> forceful about getting developers to run tox locally before they push.
>>
>> We need to compare (and this is a rough example) the resource usage
>> and developer time of complete batch results where you have a pep8
>> issue and an integration job issue in patchset one, fix both and all
>> tests pass in patchset two against pep8 failure in patchset one,
>> integration failure in patchset two, all tests pass in patchset three.
>>
>> Today:
>> patchset one:
>> pep8 FAILURE
>> unittest SUCCESS
>> integration FAILURE
>> patchset two:
>> pep8 SUCCESS
>> unittest SUCCESS
>> integration SUCCESS
>>
>> Proposed Future:
>> patchset one:
>> pep8 FAILURE
>> unittest SUCCESS
>> patchset two:
>> pep8 SUCCESS
>> unittest SUCCESS
>> integration FAILURE
>> patchset three:
>> pep8 SUCCESS
>> unittest SUCCESS
>> integration SUCCESS
>>
>> There are strictly more patchsets (developer roundtrips) and tests run
>> in my contrived example. Reality will depend on how many iterations we
>> actually see in the real world (eg are we fixing bugs reliably based
>> on test feedback and how often do unittest and integration tests fail
>> for different reasons).
>>
>> Someone with a better background in statistics will probably tell me
>> this approach is wrong, but using the elasticsearch tooling one
>> approach may be to pick say 1k changes, then for each change identify
>> which tests failed on subsequent patchsets? Then we'd be able to infer
>> with some confidence the behaviors we have in the test suites around
>> catching failures, whether they are independent across jobs and
>> whether or not we fix them in batches.
>>
>> Finally, I think we might also want to consider what the ideal is
>> here. If we find we can optimize the current system for current
>> behaviors, we also need to consider if that is worthwhile given an
>> ideal. Should developers be expected to run unittests and linters
>> locally before pushing? If so then optimizing for when they don't
>> might be effort better spent on making it easier to run the unittests
>> and linters locally and educating developers on how to do so. I think
>> we'd also ideally expect our tests to pass once they run in the gate
>> pipeline. Unfortunately I think our elastic-recheck data shows they
>> often don't and more effort in fixing those failures would provide a
>> dramatic increase in throughput (due to the compounding costs of
>> subsequent gate resets).
>>
>>>
>>> My final take is, given some middle-ground solution, like I illustrated
>>> earlier this sub-thread, it might be worth it, and the effort for
>>> boosting up the total throughput of openstack CI system by a 6% is not
>>> so bad idea.
>>>
>>> [0] http://paste.openstack.org/show/746503/
>>>
>>>
>>> --
>>> Best regards,
>>> Bogdan Dobrelya,
>>> Irc #bogdando
>>>
>>>
>>
>
>
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
More information about the openstack-discuss
mailing list