[openstack-dev] [tripleo] critical situation with CI / upgrade jobs

Marios Andreou mandreou at redhat.com
Wed Aug 16 07:37:32 UTC 2017


On Wed, Aug 16, 2017 at 4:33 AM, Emilien Macchi <emilien at redhat.com> wrote:

> So far, we're having 3 critical issues, that we all need to address as
> soon as we can.
>
> Problem #1: Upgrade jobs timeout from Newton to Ocata
> https://bugs.launchpad.net/tripleo/+bug/1702955
> Today I spent an hour to look at it and here's what I've found so far:
> depending on which public cloud we're running the TripleO CI jobs, it
> timeouts or not.
> Here's an example of Heat resources that run in our CI:
> https://www.diffchecker.com/VTXkNFuk
> On the left, resources on a job that failed (running on internap) and
> on the right (running on citycloud) it worked.
> I've been through all upgrade steps and I haven't seen specific tasks
> that take more time here or here, but some little changes that make
> the big change at the end (so hard to debug).
> Note: both jobs use AFS mirrors.
> Help on that front would be very welcome.
>
>
> Problem #2: from Ocata to Pike (containerized) missing container upload
> step
> https://bugs.launchpad.net/tripleo/+bug/1710938
> Wes has a patch (thanks!) that is currently in the gate:
> https://review.openstack.org/#/c/493972
> Thanks to that work, we managed to find the problem #3.
>
>
> Problem #3: from Ocata to Pike: all container images are
> uploaded/specified, even for services not deployed
> https://bugs.launchpad.net/tripleo/+bug/1710992
> The CI jobs are timeouting during the upgrade process because
> downloading + uploading _all_ containers in local cache takes more
> than 20 minutes.
> So this is where we are now, upgrade jobs timeout on that. Steve Baker
> is currently looking at it but we'll probably offer some help.
>
>
> Solutions:
> - for stable/ocata: make upgrade jobs non-voting
> - for pike: keep upgrade jobs non-voting and release without upgrade
> testing
>
>
+1 but for Ocata to Pike, sounds like the container/images related problems
2 and 3 above are both in progress or being looked at (weshay/sbaker ++) in
which case we might be able to fix O...P jobs at least?

For Newton to Ocata, is it consistent which clouds we are timing out on? I
've looked at that https://bugs.launchpad.net/tripleo/+bug/1702955 before
and I know other folks from upgrades have too, but couldn't find some root
cause, or any upgrades operations taking too long/timing out/error etc. If
it is consistent which clouds time out we can use that info to guide us in
the case that we make the jobs non-voting for N...O (e.g. a known list of
'timing out clouds' to decide if we should inspect the ci logs closer
before merging some patch). Obviously only until/unless we actually root
cause that one (I will also find some time to check again)



> Risks:
> - for stable/ocata: it's highly possible to inject regression if jobs
> aren't voting anymore.
> - for pike: the quality of the release won't be good enough in term of
> CI coverage comparing to Ocata.
>
> Mitigations:
> - for stable/ocata: make jobs non-voting and enforce our
> core-reviewers to pay double attention on what is landed. It should be
> temporary until we manage to fix the CI jobs.
> - for master: release RC1 without upgrade jobs and make progress
>

for master, +1 I think this is essentially what I am saying above for O...P
- sounds like problem 2 is well in progress from weshay and the other
container/image related problem 3 is the main outstanding item. Since RC1
is this week I think what you are proposing as mitigation is fair. So we
re-evaluate making these jobs voting before the final RCs end of August


> - Run TripleO upgrade scenarios as third party CI in RDO Cloud or
> somewhere with resources and without timeout constraints.


> I would like some feedback on the proposal so we can move forward this
> week,
> Thanks.
>


thanks for putting this together. I think if we really had to pick one the
O..P ci has priority obviously this week (!)... I think the
container/images related issues for O...P are both expected/teething issues
from the huge amount of work done by the containerization team and can
hopefully be resolved quickly.

marios



> --
> Emilien Macchi
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170816/3f764838/attachment.html>


More information about the OpenStack-dev mailing list