[openstack-dev] [tripleo] critical situation with CI / upgrade jobs

Wesley Hayutin whayutin at redhat.com
Wed Aug 16 03:06:20 UTC 2017


On Tue, Aug 15, 2017 at 9:33 PM, Emilien Macchi <emilien at redhat.com> wrote:

> So far, we're having 3 critical issues, that we all need to address as
> soon as we can.
>
> Problem #1: Upgrade jobs timeout from Newton to Ocata
> https://bugs.launchpad.net/tripleo/+bug/1702955
> Today I spent an hour to look at it and here's what I've found so far:
> depending on which public cloud we're running the TripleO CI jobs, it
> timeouts or not.
> Here's an example of Heat resources that run in our CI:
> https://www.diffchecker.com/VTXkNFuk
> On the left, resources on a job that failed (running on internap) and
> on the right (running on citycloud) it worked.
> I've been through all upgrade steps and I haven't seen specific tasks
> that take more time here or here, but some little changes that make
> the big change at the end (so hard to debug).
> Note: both jobs use AFS mirrors.
> Help on that front would be very welcome.
>
>
> Problem #2: from Ocata to Pike (containerized) missing container upload
> step
> https://bugs.launchpad.net/tripleo/+bug/1710938
> Wes has a patch (thanks!) that is currently in the gate:
> https://review.openstack.org/#/c/493972
> Thanks to that work, we managed to find the problem #3.
>
>
> Problem #3: from Ocata to Pike: all container images are
> uploaded/specified, even for services not deployed
> https://bugs.launchpad.net/tripleo/+bug/1710992
> The CI jobs are timeouting during the upgrade process because
> downloading + uploading _all_ containers in local cache takes more
> than 20 minutes.
> So this is where we are now, upgrade jobs timeout on that. Steve Baker
> is currently looking at it but we'll probably offer some help.
>
>
> Solutions:
> - for stable/ocata: make upgrade jobs non-voting
> - for pike: keep upgrade jobs non-voting and release without upgrade
> testing
>
> Risks:
> - for stable/ocata: it's highly possible to inject regression if jobs
> aren't voting anymore.
> - for pike: the quality of the release won't be good enough in term of
> CI coverage comparing to Ocata.
>
> Mitigations:
> - for stable/ocata: make jobs non-voting and enforce our
> core-reviewers to pay double attention on what is landed. It should be
> temporary until we manage to fix the CI jobs.
> - for master: release RC1 without upgrade jobs and make progress
> - Run TripleO upgrade scenarios as third party CI in RDO Cloud or
> somewhere with resources and without timeout constraints.
>
> I would like some feedback on the proposal so we can move forward this
> week,
> Thanks.
> --
> Emilien Macchi
>

I think due to some of the limitations with run times upstream we may need
to rethink the workflow with upgrade tests upstream. It's not very clear to
me what can be done with the multinode nodepool jobs outside of what is
already being done.  I think we do have some choices with ovb jobs.   I'm
not going to try and solve in this email but rethinking how we CI upgrades
in the upstream infrastructure should be a focus for the Queens PTG.  We
will need to focus on bringing run times significantly down as it's
incredibly difficult to run two installs in 175 minutes across all the
upstream cloud providers.

Thanks Emilien for all the work you have done around upgrades!



>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170815/79348508/attachment.html>


More information about the OpenStack-dev mailing list