[openstack-dev] [tripleo] container jobs are unstable

Dan Prince dprince at redhat.com
Thu Mar 30 14:11:04 UTC 2017


On Wed, 2017-03-29 at 22:07 -0400, Paul Belanger wrote:
> On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:
> > On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi <emilien at redhat.com
> > > wrote:
> > 
> > > On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco <flavio at redhat.co
> > > m> wrote:
> > > > On 23/03/17 16:24 +0100, Martin André wrote:
> > > > > 
> > > > > On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince <dprince at redhat.c
> > > > > om> wrote:
> > > > > > 
> > > > > > On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> > > > > > > 
> > > > > > > On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> > > > > > > > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > > > > > > > > Hey,
> > > > > > > > > 
> > > > > > > > > I've noticed that container jobs look pretty unstable
> > > > > > > > > lately; to
> > > > > > > > > me,
> > > > > > > > > it sounds like a timeout:
> > > > > > > > > http://logs.openstack.org/19/447319/2/check-tripleo/g
> > > > > > > > > ate-tripleo-
> > > > > > > > > ci-centos-7-ovb-containers-oooq-
> > > > > > > > > nv/bca496a/console.html#_2017-03-
> > > > > > > > > 22_00_08_55_358973
> > > > > > > > 
> > > > > > > > There are different hypothesis on what is going on
> > > > > > > > here. Some
> > > > > > > > patches have
> > > > > > > > landed to improve the write performance on containers
> > > > > > > > by using
> > > > > > > > hostpath mounts
> > > > > > > > but we think the real slowness is coming from the
> > > > > > > > images download.
> > > > > > > > 
> > > > > > > > This said, this is still under investigation and the
> > > > > > > > containers
> > > > > > > > squad will
> > > > > > > > report back as soon as there are new findings.
> > > > > > > 
> > > > > > > Also, to be more precise, Martin André is looking into
> > > > > > > this. He also
> > > > > > > fixed the
> > > > > > > gate in the last 2 weeks.
> > > > > > 
> > > > > > 
> > > > > > I spoke w/ Martin on IRC. He seems to think this is the
> > > > > > cause of some
> > > > > > of the failures:
> > > > > > 
> > > > > > http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> > > 
> > > tripleo-ci-cen
> > > > > > tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-
> > > > > > controller-
> > > > > > 0/var/log/extra/docker/containers/heat_engine/log/heat/heat
> > > > > > -
> > > > > > engine.log.txt.gz#_2017-03-21_20_26_29_697
> > > > > > 
> > > > > > 
> > > > > > Looks like Heat isn't able to create Nova instances in the
> > > > > > overcloud
> > > > > > due to "Host 'overcloud-novacompute-0' is not mapped to any
> > > > > > cell'. This
> > > > > > means our cells initialization code for containers may not
> > > > > > be quite
> > > > > > right... or there is a race somewhere.
> > > > > 
> > > > > 
> > > > > Here are some findings. I've looked at time measures from CI
> > > > > for
> > > > > https://review.openstack.org/#/c/448533/ which provided the
> > > > > most
> > > > > recent results:
> > > > > 
> > > > > * gate-tripleo-ci-centos-7-ovb-ha [1]
> > > > >    undercloud install: 23
> > > > >    overcloud deploy: 72
> > > > >    total time: 125
> > > > > * gate-tripleo-ci-centos-7-ovb-nonha [2]
> > > > >    undercloud install: 25
> > > > >    overcloud deploy: 48
> > > > >    total time: 122
> > > > > * gate-tripleo-ci-centos-7-ovb-updates [3]
> > > > >    undercloud install: 24
> > > > >    overcloud deploy: 57
> > > > >    total time: 152
> > > > > * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> > > > >    undercloud install: 28
> > > > >    overcloud deploy: 48
> > > > >    total time: 165 (timeout)
> > > > > 
> > > > > Looking at the undercloud & overcloud install times, the most
> > > > > task
> > > > > consuming tasks, the containers job isn't doing that bad
> > > > > compared to
> > > > > other OVB jobs. But looking closer I could see that:
> > > > > - the containers job pulls docker images from dockerhub, this
> > > > > process
> > > > > takes roughly 18 min.
> > > > 
> > > > 
> > > > I think we can optimize this a bit by having the script that
> > > > populates
> > > 
> > > the
> > > > local
> > > > registry in the overcloud job to run in parallel. The docker
> > > > daemon can
> > > 
> > > do
> > > > multiple pulls w/o problems.
> > > > 
> > > > > - the overcloud validate task takes 10 min more than it
> > > > > should because
> > > > > of the bug Dan mentioned (a fix is in the queue at
> > > > > https://review.openstack.org/#/c/448575/)
> > > > 
> > > > 
> > > > +A
> > > > 
> > > > > - the postci takes a long time with quickstart, 13 min (4 min
> > > > > alone
> > > > > spent on docker log collection) whereas it takes only 3 min
> > > > > when using
> > > > > tripleo.sh
> > > > 
> > > > 
> > > > mmh, does this have anything to do with ansible being in
> > > > between? Or is
> > > 
> > > that
> > > > time specifically for the part that gets the logs?
> > > > 
> > > > > 
> > > > > Adding all these numbers, we're at about 40 min of additional
> > > > > time for
> > > > > oooq containers job which is enough to cross the CI job
> > > > > limit.
> > > > > 
> > > > > There is certainly a lot of room for optimization here and
> > > > > there and
> > > > > I'll explore how we can speed up the containers CI job over
> > > > > the next
> > > > 
> > > > 
> > > > Thanks a lot for the update. The time break down is fantastic,
> > > > Flavio
> > > 
> > > TBH the problem is far from being solved:
> > > 
> > > 1. Click on https://status-tripleoci.rhcloud.com/
> > > 2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv
> > > 
> > > Container job has been failing more than 55% of the time.
> > > 
> > > As a reference,
> > > gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
> > > gate-tripleo-ci-centos-7-ovb-ha has 64% of success.
> > > 
> > > It clearly means the ovb-containers job was and is not ready to
> > > be run
> > > in the check pipeline, it's not reliable enough.
> > > 
> > > The current queue time in TripleO OVB is 11 hours. This is not
> > > acceptable for TripleO developers and we need a short term
> > > solution,
> > > which is disabling this job from the check pipeline:
> > > https://review.openstack.org/#/c/451546/
> > > 
> > > 
> > 
> > Yes, given resource constraints I don't see an alternative in the
> > short
> > term.
> > 
> > 
> > > On the long-term, we need to:
> > > 
> > > - Stabilize ovb-containers which is AFIK already WIP by Martin
> > > (kudos
> > > to him). My hope is Martin gets enough help from Container squad
> > > to
> > > work on this topic.
> > > - Remove ovb-nonha scenario from the check pipeline - and
> > > probably
> > > keep it periodic. Dan Prince started some work on it:
> > > https://review.openstack.org/#/c/449791/ and
> > > https://review.openstack.org/#/c/449785/ - but not much progress
> > > on it
> > > in the recent days.
> > > - Engage some work on getting multinode-scenario(001,002,003,004) 
> > > jobs
> > > for containers, so we don't need much OVB jobs (only one
> > > probably) for
> > > container scenarios.
> > > 
> > > 
> > 
> > Another work item in progress which should help with the stability
> > of the
> > ovb containers job is Dan has set up a docker-distribution based
> > registry
> > on a node in rhcloud. Once jobs are pulling images from this there
> > should
> > be less timeouts due to image pull speed.
> > 
> 
> Before we go and stand up private infrastructure for tripleo to
> depend on, can
> we please work on solving this is for all openstack projects
> upstream? We do
> want to run regional mirrors for docker things, however we need to
> address
> issues on how to integration this with AFS.
> 
> We are trying to break the cycle of tripleo standing up private
> infrastructure
> and consume more community based. So far we are making good progress,
> however I
> would see this effort a step backwards, not forward.

I would propose that we do both. Lets setup resources in-rack that help
us efficiently cache containers from dockerhub. And lets also do the
same within infra so that jobs running there benefit as well.

IMO a local, in-rack proxy/mirror that requires little to no
maintenance (which is all we are setting up here really) is a very good
pattern.

Are there other ideas that will allow us to avoid the overhead of
continually pulling images into our Rack from dockerhub?

Dan

> > 
> > > I know everyone is busy by working on container support in
> > > composable
> > > services, but we might assign more resources on CI work here,
> > > otherwise I'm not sure how we're going to stabilize the CI.
> > > 
> > > Any feedback is very welcome.
> > > 
> > > > 
> > > > > weeks.
> > > > > 
> > > > > Martin
> > > > > 
> > > > > [1]
> > > > > http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> > > 
> > > tripleo-ci-centos-7-ovb-ha/d2c1b16/
> > > > > [2]
> > > > > http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> > > 
> > > tripleo-ci-centos-7-ovb-nonha/d6df760/
> > > > > [3]
> > > > > http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> > > 
> > > tripleo-ci-centos-7-ovb-updates/3b1f795/
> > > > > [4]
> > > > > http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> > > 
> > > tripleo-ci-centos-7-ovb-containers-oooq-nv/b816f20/
> > > > > 
> > > > > > Dan
> > > > > > 
> > > > > > > 
> > > > > > > Flavio
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > _________________________________________________________
> > > > > > > ____________
> > > > > > > _____
> > > > > > > OpenStack Development Mailing List (not for usage
> > > > > > > questions)
> > > > > > > Unsubscribe: OpenStack-dev-request at lists.openstack.org?su
> > > > > > > bject:unsubs
> > > > > > > cribe
> > > > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/opens
> > > > > > > tack-dev
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > ___________________________________________________________
> > > > > > _
> > > 
> > > ______________
> > > > > > OpenStack Development Mailing List (not for usage
> > > > > > questions)
> > > > > > Unsubscribe:
> > > > > > OpenStack-dev-request at lists.openstack.org?subject:unsubscri
> > > > > > be
> > > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/opensta
> > > > > > ck-dev
> > > > > 
> > > > > 
> > > > > ____________________________________________________________
> > > 
> > > ______________
> > > > > OpenStack Development Mailing List (not for usage questions)
> > > > > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subjec
> > > > > t:
> > > 
> > > unsubscribe
> > > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> > > > > -dev
> > > > 
> > > > 
> > > > --
> > > > @flaper87
> > > > Flavio Percoco
> > > > 
> > > > ____________________________________________________________
> > > 
> > > ______________
> > > > OpenStack Development Mailing List (not for usage questions)
> > > > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:
> > > 
> > > unsubscribe
> > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-d
> > > > ev
> > > > 
> > > 
> > > 
> > > 
> > > --
> > > Emilien Macchi
> > > 
> > > _________________________________________________________________
> > > _________
> > > OpenStack Development Mailing List (not for usage questions)
> > > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:un
> > > subscribe
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > > 
> > ___________________________________________________________________
> > _______
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsu
> > bscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> _____________________________________________________________________
> _____
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubs
> cribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list