[openstack-dev] [tripleo] container jobs are unstable

Paul Belanger pabelanger at redhat.com
Thu Mar 30 02:07:24 UTC 2017


On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:
> On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi <emilien at redhat.com> wrote:
> 
> > On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco <flavio at redhat.com> wrote:
> > > On 23/03/17 16:24 +0100, Martin André wrote:
> > >>
> > >> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince <dprince at redhat.com> wrote:
> > >>>
> > >>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> > >>>>
> > >>>> On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> > >>>> > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > >>>> > > Hey,
> > >>>> > >
> > >>>> > > I've noticed that container jobs look pretty unstable lately; to
> > >>>> > > me,
> > >>>> > > it sounds like a timeout:
> > >>>> > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
> > >>>> > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
> > >>>> > > 22_00_08_55_358973
> > >>>> >
> > >>>> > There are different hypothesis on what is going on here. Some
> > >>>> > patches have
> > >>>> > landed to improve the write performance on containers by using
> > >>>> > hostpath mounts
> > >>>> > but we think the real slowness is coming from the images download.
> > >>>> >
> > >>>> > This said, this is still under investigation and the containers
> > >>>> > squad will
> > >>>> > report back as soon as there are new findings.
> > >>>>
> > >>>> Also, to be more precise, Martin André is looking into this. He also
> > >>>> fixed the
> > >>>> gate in the last 2 weeks.
> > >>>
> > >>>
> > >>> I spoke w/ Martin on IRC. He seems to think this is the cause of some
> > >>> of the failures:
> > >>>
> > >>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> > tripleo-ci-cen
> > >>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
> > >>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> > >>> engine.log.txt.gz#_2017-03-21_20_26_29_697
> > >>>
> > >>>
> > >>> Looks like Heat isn't able to create Nova instances in the overcloud
> > >>> due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. This
> > >>> means our cells initialization code for containers may not be quite
> > >>> right... or there is a race somewhere.
> > >>
> > >>
> > >> Here are some findings. I've looked at time measures from CI for
> > >> https://review.openstack.org/#/c/448533/ which provided the most
> > >> recent results:
> > >>
> > >> * gate-tripleo-ci-centos-7-ovb-ha [1]
> > >>    undercloud install: 23
> > >>    overcloud deploy: 72
> > >>    total time: 125
> > >> * gate-tripleo-ci-centos-7-ovb-nonha [2]
> > >>    undercloud install: 25
> > >>    overcloud deploy: 48
> > >>    total time: 122
> > >> * gate-tripleo-ci-centos-7-ovb-updates [3]
> > >>    undercloud install: 24
> > >>    overcloud deploy: 57
> > >>    total time: 152
> > >> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> > >>    undercloud install: 28
> > >>    overcloud deploy: 48
> > >>    total time: 165 (timeout)
> > >>
> > >> Looking at the undercloud & overcloud install times, the most task
> > >> consuming tasks, the containers job isn't doing that bad compared to
> > >> other OVB jobs. But looking closer I could see that:
> > >> - the containers job pulls docker images from dockerhub, this process
> > >> takes roughly 18 min.
> > >
> > >
> > > I think we can optimize this a bit by having the script that populates
> > the
> > > local
> > > registry in the overcloud job to run in parallel. The docker daemon can
> > do
> > > multiple pulls w/o problems.
> > >
> > >> - the overcloud validate task takes 10 min more than it should because
> > >> of the bug Dan mentioned (a fix is in the queue at
> > >> https://review.openstack.org/#/c/448575/)
> > >
> > >
> > > +A
> > >
> > >> - the postci takes a long time with quickstart, 13 min (4 min alone
> > >> spent on docker log collection) whereas it takes only 3 min when using
> > >> tripleo.sh
> > >
> > >
> > > mmh, does this have anything to do with ansible being in between? Or is
> > that
> > > time specifically for the part that gets the logs?
> > >
> > >>
> > >> Adding all these numbers, we're at about 40 min of additional time for
> > >> oooq containers job which is enough to cross the CI job limit.
> > >>
> > >> There is certainly a lot of room for optimization here and there and
> > >> I'll explore how we can speed up the containers CI job over the next
> > >
> > >
> > > Thanks a lot for the update. The time break down is fantastic,
> > > Flavio
> >
> > TBH the problem is far from being solved:
> >
> > 1. Click on https://status-tripleoci.rhcloud.com/
> > 2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv
> >
> > Container job has been failing more than 55% of the time.
> >
> > As a reference,
> > gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
> > gate-tripleo-ci-centos-7-ovb-ha has 64% of success.
> >
> > It clearly means the ovb-containers job was and is not ready to be run
> > in the check pipeline, it's not reliable enough.
> >
> > The current queue time in TripleO OVB is 11 hours. This is not
> > acceptable for TripleO developers and we need a short term solution,
> > which is disabling this job from the check pipeline:
> > https://review.openstack.org/#/c/451546/
> >
> >
> Yes, given resource constraints I don't see an alternative in the short
> term.
> 
> 
> > On the long-term, we need to:
> >
> > - Stabilize ovb-containers which is AFIK already WIP by Martin (kudos
> > to him). My hope is Martin gets enough help from Container squad to
> > work on this topic.
> > - Remove ovb-nonha scenario from the check pipeline - and probably
> > keep it periodic. Dan Prince started some work on it:
> > https://review.openstack.org/#/c/449791/ and
> > https://review.openstack.org/#/c/449785/ - but not much progress on it
> > in the recent days.
> > - Engage some work on getting multinode-scenario(001,002,003,004) jobs
> > for containers, so we don't need much OVB jobs (only one probably) for
> > container scenarios.
> >
> >
> Another work item in progress which should help with the stability of the
> ovb containers job is Dan has set up a docker-distribution based registry
> on a node in rhcloud. Once jobs are pulling images from this there should
> be less timeouts due to image pull speed.
> 
Before we go and stand up private infrastructure for tripleo to depend on, can
we please work on solving this is for all openstack projects upstream? We do
want to run regional mirrors for docker things, however we need to address
issues on how to integration this with AFS.

We are trying to break the cycle of tripleo standing up private infrastructure
and consume more community based. So far we are making good progress, however I
would see this effort a step backwards, not forward.

> 
> > I know everyone is busy by working on container support in composable
> > services, but we might assign more resources on CI work here,
> > otherwise I'm not sure how we're going to stabilize the CI.
> >
> > Any feedback is very welcome.
> >
> > >
> > >> weeks.
> > >>
> > >> Martin
> > >>
> > >> [1]
> > >> http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> > tripleo-ci-centos-7-ovb-ha/d2c1b16/
> > >> [2]
> > >> http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> > tripleo-ci-centos-7-ovb-nonha/d6df760/
> > >> [3]
> > >> http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> > tripleo-ci-centos-7-ovb-updates/3b1f795/
> > >> [4]
> > >> http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> > tripleo-ci-centos-7-ovb-containers-oooq-nv/b816f20/
> > >>
> > >>> Dan
> > >>>
> > >>>>
> > >>>> Flavio
> > >>>>
> > >>>>
> > >>>>
> > >>>> _____________________________________________________________________
> > >>>> _____
> > >>>> OpenStack Development Mailing List (not for usage questions)
> > >>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubs
> > >>>> cribe
> > >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >>>
> > >>>
> > >>>
> > >>> ____________________________________________________________
> > ______________
> > >>> OpenStack Development Mailing List (not for usage questions)
> > >>> Unsubscribe:
> > >>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >>
> > >>
> > >> ____________________________________________________________
> > ______________
> > >> OpenStack Development Mailing List (not for usage questions)
> > >> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:
> > unsubscribe
> > >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
> > >
> > > --
> > > @flaper87
> > > Flavio Percoco
> > >
> > > ____________________________________________________________
> > ______________
> > > OpenStack Development Mailing List (not for usage questions)
> > > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:
> > unsubscribe
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
> >
> >
> >
> > --
> > Emilien Macchi
> >
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >

> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list