[openstack-dev] [ci][infra][tripleo] Multi-staged check pipelines for Zuul v3 proposal
Bogdan Dobrelya
bdobreli at redhat.com
Tue May 15 15:54:42 UTC 2018
On 5/15/18 5:08 PM, Sagi Shnaidman wrote:
> Bogdan,
>
> I think before final decisions we need to know exactly - what a price we
> need to pay? Without exact numbers it will be difficult to discuss about.
> I we need to wait 80 mins of undercloud-containers job to finish for
> starting all other jobs, it will be about 4.5 hours to wait for result
> (+ 4.5 hours in gate) which is too big price imho and doesn't worth an
> effort.
>
> What are exact numbers we are talking about?
I fully agree but can't have those numbers, sorry! As I noted above,
those are definitely sitting in openstack-infra's elastic search DB,
just needed to get extracted with some assistance of folks who know more
on that!
>
> Thanks
>
>
> On Tue, May 15, 2018 at 3:07 PM, Bogdan Dobrelya <bdobreli at redhat.com
> <mailto:bdobreli at redhat.com>> wrote:
>
> Let me clarify the problem I want to solve with pipelines.
>
> It is getting *hard* to develop things and move patches to the Happy
> End (merged):
> - Patches wait too long for CI jobs to start. It should be minutes
> and not hours of waiting.
> - If a patch fails a job w/o a good reason, the consequent recheck
> operation repeat waiting all over again.
>
> How pipelines may help solve it?
> Pipelines only alleviate, not solve the problem of waiting. We only
> want to build pipelines for the main zuul check process, omitting
> gating and RDO CI (for now).
>
> Where are two cases to consider:
> - A patch succeeds all checks
> - A patch fails a check with dependencies
>
> The latter cases benefit us the most, when pipelines are designed
> like it is proposed here. So that any jobs expected to fail, when a
> dependency fails, will be omitted from execution. This saves HW
> resources and zuul queue places a lot, making it available for other
> patches and allowing those to have CI jobs started faster (less
> waiting!). When we have "recheck storms", like because of some known
> intermittent side issue, that outcome is multiplied by the recheck
> storm um... level, and delivers even better and absolutely amazing
> results :) Zuul queue will not be growing insanely getting
> overwhelmed by multiple clones of the rechecked jobs highly likely
> deemed to fail, and blocking other patches what might have chances
> to pass checks as non-affected by that intermittent issue.
>
> And for the first case, when a patch succeeds, it takes some
> extended time, and that is the price to pay. How much time it takes
> to finish in a pipeline fully depends on implementation.
>
> The effectiveness could only be measured with numbers extracted from
> elastic search data, like average time to wait for a job to start,
> success vs fail execution time percentiles for a job, average amount
> of rechecks, recheck storms history et al. I don't have that data
> and don't know how to get it. Any help with that is very appreciated
> and could really help to move the proposed patches forward or
> decline it. And we could then compare "before" and "after" as well.
>
> I hope that explains the problem scope and the methodology to
> address that.
>
>
> On 5/14/18 6:15 PM, Bogdan Dobrelya wrote:
>
> An update for your review please folks
>
> Bogdan Dobrelya <bdobreli at redhat.com <http://redhat.com>>
> writes:
>
> Hello.
> As Zuul documentation [0] explains, the names "check",
> "gate", and
> "post" may be altered for more advanced pipelines. Is
> it doable to
> introduce, for particular openstack projects, multiple check
> stages/steps as check-1, check-2 and so on? And is it
> possible to make
> the consequent steps reusing environments from the
> previous steps
> finished with?
>
> Narrowing down to tripleo CI scope, the problem I'd want
> we to solve
> with this "virtual RFE", and using such multi-staged
> check pipelines,
> is reducing (ideally, de-duplicating) some of the common
> steps for
> existing CI jobs.
>
>
> What you're describing sounds more like a job graph within a
> pipeline.
> See:
> https://docs.openstack.org/infra/zuul/user/config.html#attr-job.dependencies
> <https://docs.openstack.org/infra/zuul/user/config.html#attr-job.dependencies>
>
> for how to configure a job to run only after another job has
> completed.
> There is also a facility to pass data between such jobs.
>
> ... (skipped) ...
>
> Creating a job graph to have one job use the results of the
> previous job
> can make sense in a lot of cases. It doesn't always save *time*
> however.
>
> It's worth noting that in OpenStack's Zuul, we have made an
> explicit
> choice not to have long-running integration jobs depend on
> shorter pep8
> or tox jobs, and that's because we value developer time more
> than CPU
> time. We would rather run all of the tests and return all
> of the
> results so a developer can fix all of the errors as quickly
> as possible,
> rather than forcing an iterative workflow where they have to
> fix all the
> whitespace issues before the CI system will tell them which
> actual tests
> broke.
>
> -Jim
>
>
> I proposed a few zuul dependencies [0], [1] to tripleo CI
> pipelines for undercloud deployments vs upgrades testing (and
> some more). Given that those undercloud jobs have not so high
> fail rates though, I think Emilien is right in his comments and
> those would buy us nothing.
>
> From the other side, what do you think folks of making the
> tripleo-ci-centos-7-3nodes-multinode depend on
> tripleo-ci-centos-7-containers-multinode [2]? The former seems
> quite faily and long running, and is non-voting. It deploys (see
> featuresets configs [3]*) a 3 nodes in HA fashion. And it seems
> almost never passing, when the containers-multinode fails - see
> the CI stats page [4]. I've found only a 2 cases there for the
> otherwise situation, when containers-multinode fails, but
> 3nodes-multinode passes. So cutting off those future failures
> via the dependency added, *would* buy us something and allow
> other jobs to wait less to commence, by a reasonable price of
> somewhat extended time of the main zuul pipeline. I think it
> makes sense and that extended CI time will not overhead the RDO
> CI execution times so much to become a problem. WDYT?
>
> [0] https://review.openstack.org/#/c/568275/
> <https://review.openstack.org/#/c/568275/>
> [1] https://review.openstack.org/#/c/568278/
> <https://review.openstack.org/#/c/568278/>
> [2] https://review.openstack.org/#/c/568326/
> <https://review.openstack.org/#/c/568326/>
> [3]
> https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html
> <https://docs.openstack.org/tripleo-quickstart/latest/feature-configuration.html>
>
> [4] http://tripleo.org/cistatus.html
> <http://tripleo.org/cistatus.html>
>
> * ignore the column 1, it's obsolete, all CI jobs now using
> configs download AFAICT...
>
>
>
> --
> Best regards,
> Bogdan Dobrelya,
> Irc #bogdando
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>
>
>
>
>
> --
> Best regards
> Sagi Shnaidman
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
More information about the OpenStack-dev
mailing list