[openstack-dev] [ci][infra][tripleo] Multi-staged check pipelines for Zuul v3 proposal
Sagi Shnaidman
sshnaidm at redhat.com
Tue May 15 15:08:01 UTC 2018
Bogdan,
I think before final decisions we need to know exactly - what a price we
need to pay? Without exact numbers it will be difficult to discuss about.
I we need to wait 80 mins of undercloud-containers job to finish for
starting all other jobs, it will be about 4.5 hours to wait for result (+
4.5 hours in gate) which is too big price imho and doesn't worth an effort.
What are exact numbers we are talking about?
Thanks
On Tue, May 15, 2018 at 3:07 PM, Bogdan Dobrelya <bdobreli at redhat.com>
wrote:
> Let me clarify the problem I want to solve with pipelines.
>
> It is getting *hard* to develop things and move patches to the Happy End
> (merged):
> - Patches wait too long for CI jobs to start. It should be minutes and not
> hours of waiting.
> - If a patch fails a job w/o a good reason, the consequent recheck
> operation repeat waiting all over again.
>
> How pipelines may help solve it?
> Pipelines only alleviate, not solve the problem of waiting. We only want
> to build pipelines for the main zuul check process, omitting gating and RDO
> CI (for now).
>
> Where are two cases to consider:
> - A patch succeeds all checks
> - A patch fails a check with dependencies
>
> The latter cases benefit us the most, when pipelines are designed like it
> is proposed here. So that any jobs expected to fail, when a dependency
> fails, will be omitted from execution. This saves HW resources and zuul
> queue places a lot, making it available for other patches and allowing
> those to have CI jobs started faster (less waiting!). When we have "recheck
> storms", like because of some known intermittent side issue, that outcome
> is multiplied by the recheck storm um... level, and delivers even better
> and absolutely amazing results :) Zuul queue will not be growing insanely
> getting overwhelmed by multiple clones of the rechecked jobs highly likely
> deemed to fail, and blocking other patches what might have chances to pass
> checks as non-affected by that intermittent issue.
>
> And for the first case, when a patch succeeds, it takes some extended
> time, and that is the price to pay. How much time it takes to finish in a
> pipeline fully depends on implementation.
>
> The effectiveness could only be measured with numbers extracted from
> elastic search data, like average time to wait for a job to start, success
> vs fail execution time percentiles for a job, average amount of rechecks,
> recheck storms history et al. I don't have that data and don't know how to
> get it. Any help with that is very appreciated and could really help to
> move the proposed patches forward or decline it. And we could then compare
> "before" and "after" as well.
>
> I hope that explains the problem scope and the methodology to address that.
>
>
> On 5/14/18 6:15 PM, Bogdan Dobrelya wrote:
>
>> An update for your review please folks
>>
>> Bogdan Dobrelya <bdobreli at redhat.com> writes:
>>>
>>> Hello.
>>>> As Zuul documentation [0] explains, the names "check", "gate", and
>>>> "post" may be altered for more advanced pipelines. Is it doable to
>>>> introduce, for particular openstack projects, multiple check
>>>> stages/steps as check-1, check-2 and so on? And is it possible to make
>>>> the consequent steps reusing environments from the previous steps
>>>> finished with?
>>>>
>>>> Narrowing down to tripleo CI scope, the problem I'd want we to solve
>>>> with this "virtual RFE", and using such multi-staged check pipelines,
>>>> is reducing (ideally, de-duplicating) some of the common steps for
>>>> existing CI jobs.
>>>>
>>>
>>> What you're describing sounds more like a job graph within a pipeline.
>>> See: https://docs.openstack.org/infra/zuul/user/config.html#attr-
>>> job.dependencies
>>> for how to configure a job to run only after another job has completed.
>>> There is also a facility to pass data between such jobs.
>>>
>>> ... (skipped) ...
>>>
>>> Creating a job graph to have one job use the results of the previous job
>>> can make sense in a lot of cases. It doesn't always save *time*
>>> however.
>>>
>>> It's worth noting that in OpenStack's Zuul, we have made an explicit
>>> choice not to have long-running integration jobs depend on shorter pep8
>>> or tox jobs, and that's because we value developer time more than CPU
>>> time. We would rather run all of the tests and return all of the
>>> results so a developer can fix all of the errors as quickly as possible,
>>> rather than forcing an iterative workflow where they have to fix all the
>>> whitespace issues before the CI system will tell them which actual tests
>>> broke.
>>>
>>> -Jim
>>>
>>
>> I proposed a few zuul dependencies [0], [1] to tripleo CI pipelines for
>> undercloud deployments vs upgrades testing (and some more). Given that
>> those undercloud jobs have not so high fail rates though, I think Emilien
>> is right in his comments and those would buy us nothing.
>>
>> From the other side, what do you think folks of making the
>> tripleo-ci-centos-7-3nodes-multinode depend on
>> tripleo-ci-centos-7-containers-multinode [2]? The former seems quite
>> faily and long running, and is non-voting. It deploys (see featuresets
>> configs [3]*) a 3 nodes in HA fashion. And it seems almost never passing,
>> when the containers-multinode fails - see the CI stats page [4]. I've found
>> only a 2 cases there for the otherwise situation, when containers-multinode
>> fails, but 3nodes-multinode passes. So cutting off those future failures
>> via the dependency added, *would* buy us something and allow other jobs to
>> wait less to commence, by a reasonable price of somewhat extended time of
>> the main zuul pipeline. I think it makes sense and that extended CI time
>> will not overhead the RDO CI execution times so much to become a problem.
>> WDYT?
>>
>> [0] https://review.openstack.org/#/c/568275/
>> [1] https://review.openstack.org/#/c/568278/
>> [2] https://review.openstack.org/#/c/568326/
>> [3] https://docs.openstack.org/tripleo-quickstart/latest/feature
>> -configuration.html
>> [4] http://tripleo.org/cistatus.html
>>
>> * ignore the column 1, it's obsolete, all CI jobs now using configs
>> download AFAICT...
>>
>>
>
> --
> Best regards,
> Bogdan Dobrelya,
> Irc #bogdando
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
--
Best regards
Sagi Shnaidman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20180515/3a27c3de/attachment.html>
More information about the OpenStack-dev
mailing list