[openstack-dev] [tripleo] critical situation with CI / upgrade jobs

Paul Belanger pabelanger at redhat.com
Wed Aug 16 14:00:09 UTC 2017


On Tue, Aug 15, 2017 at 11:06:20PM -0400, Wesley Hayutin wrote:
> On Tue, Aug 15, 2017 at 9:33 PM, Emilien Macchi <emilien at redhat.com> wrote:
> 
> > So far, we're having 3 critical issues, that we all need to address as
> > soon as we can.
> >
> > Problem #1: Upgrade jobs timeout from Newton to Ocata
> > https://bugs.launchpad.net/tripleo/+bug/1702955
> > Today I spent an hour to look at it and here's what I've found so far:
> > depending on which public cloud we're running the TripleO CI jobs, it
> > timeouts or not.
> > Here's an example of Heat resources that run in our CI:
> > https://www.diffchecker.com/VTXkNFuk
> > On the left, resources on a job that failed (running on internap) and
> > on the right (running on citycloud) it worked.
> > I've been through all upgrade steps and I haven't seen specific tasks
> > that take more time here or here, but some little changes that make
> > the big change at the end (so hard to debug).
> > Note: both jobs use AFS mirrors.
> > Help on that front would be very welcome.
> >
> >
> > Problem #2: from Ocata to Pike (containerized) missing container upload
> > step
> > https://bugs.launchpad.net/tripleo/+bug/1710938
> > Wes has a patch (thanks!) that is currently in the gate:
> > https://review.openstack.org/#/c/493972
> > Thanks to that work, we managed to find the problem #3.
> >
> >
> > Problem #3: from Ocata to Pike: all container images are
> > uploaded/specified, even for services not deployed
> > https://bugs.launchpad.net/tripleo/+bug/1710992
> > The CI jobs are timeouting during the upgrade process because
> > downloading + uploading _all_ containers in local cache takes more
> > than 20 minutes.
> > So this is where we are now, upgrade jobs timeout on that. Steve Baker
> > is currently looking at it but we'll probably offer some help.
> >
> >
> > Solutions:
> > - for stable/ocata: make upgrade jobs non-voting
> > - for pike: keep upgrade jobs non-voting and release without upgrade
> > testing
> >
> > Risks:
> > - for stable/ocata: it's highly possible to inject regression if jobs
> > aren't voting anymore.
> > - for pike: the quality of the release won't be good enough in term of
> > CI coverage comparing to Ocata.
> >
> > Mitigations:
> > - for stable/ocata: make jobs non-voting and enforce our
> > core-reviewers to pay double attention on what is landed. It should be
> > temporary until we manage to fix the CI jobs.
> > - for master: release RC1 without upgrade jobs and make progress
> > - Run TripleO upgrade scenarios as third party CI in RDO Cloud or
> > somewhere with resources and without timeout constraints.
> >
> > I would like some feedback on the proposal so we can move forward this
> > week,
> > Thanks.
> > --
> > Emilien Macchi
> >
> 
> I think due to some of the limitations with run times upstream we may need
> to rethink the workflow with upgrade tests upstream. It's not very clear to
> me what can be done with the multinode nodepool jobs outside of what is
> already being done.  I think we do have some choices with ovb jobs.   I'm
> not going to try and solve in this email but rethinking how we CI upgrades
> in the upstream infrastructure should be a focus for the Queens PTG.  We
> will need to focus on bringing run times significantly down as it's
> incredibly difficult to run two installs in 175 minutes across all the
> upstream cloud providers.
> 
Can you explain in more details where the bottlenecks are for the 175 mins?
That's just shy of 3 hours, and seems like more then enough time.

Not that it can be solved now, but maybe it is time to look at these jobs the
other way, how can we make them faster and what optimizations need to be made.

One example, we spend a lot of time in rebuilding RPM package with DLRN.  It is
possible in zuulv3 we'll be able to make changes to the CI workflow, so only 1
nodes builds a package, then all other jobs download new packages from that
node.

Another thing we can look at, is more parallel testing inplace of serial. I
can't point to anything specific, but would be helpful to sit down with sombody
to better understand all the back and forth between undercloud / overcloud /
multinodes / etc.

> Thanks Emilien for all the work you have done around upgrades!
> 
> 
> 
> >
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >

> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list