[openstack-dev] [tripleo] container jobs are unstable

Wesley Hayutin whayutin at redhat.com
Thu Apr 6 21:42:42 UTC 2017


On Thu, Mar 30, 2017 at 10:08 AM, Steven Hardy <shardy at redhat.com> wrote:

> On Wed, Mar 29, 2017 at 10:07:24PM -0400, Paul Belanger wrote:
> > On Thu, Mar 30, 2017 at 09:56:59AM +1300, Steve Baker wrote:
> > > On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi <emilien at redhat.com>
> wrote:
> > >
> > > > On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco <flavio at redhat.com>
> wrote:
> > > > > On 23/03/17 16:24 +0100, Martin André wrote:
> > > > >>
> > > > >> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince <dprince at redhat.com>
> wrote:
> > > > >>>
> > > > >>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> > > > >>>>
> > > > >>>> On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> > > > >>>> > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> > > > >>>> > > Hey,
> > > > >>>> > >
> > > > >>>> > > I've noticed that container jobs look pretty unstable
> lately; to
> > > > >>>> > > me,
> > > > >>>> > > it sounds like a timeout:
> > > > >>>> > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-
> tripleo-
> > > > >>>> > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_
> 2017-03-
> > > > >>>> > > 22_00_08_55_358973
> > > > >>>> >
> > > > >>>> > There are different hypothesis on what is going on here. Some
> > > > >>>> > patches have
> > > > >>>> > landed to improve the write performance on containers by using
> > > > >>>> > hostpath mounts
> > > > >>>> > but we think the real slowness is coming from the images
> download.
> > > > >>>> >
> > > > >>>> > This said, this is still under investigation and the
> containers
> > > > >>>> > squad will
> > > > >>>> > report back as soon as there are new findings.
> > > > >>>>
> > > > >>>> Also, to be more precise, Martin André is looking into this. He
> also
> > > > >>>> fixed the
> > > > >>>> gate in the last 2 weeks.
> > > > >>>
> > > > >>>
> > > > >>> I spoke w/ Martin on IRC. He seems to think this is the cause of
> some
> > > > >>> of the failures:
> > > > >>>
> > > > >>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> > > > tripleo-ci-cen
> > > > >>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-
> controller-
> > > > >>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> > > > >>> engine.log.txt.gz#_2017-03-21_20_26_29_697
> > > > >>>
> > > > >>>
> > > > >>> Looks like Heat isn't able to create Nova instances in the
> overcloud
> > > > >>> due to "Host 'overcloud-novacompute-0' is not mapped to any
> cell'. This
> > > > >>> means our cells initialization code for containers may not be
> quite
> > > > >>> right... or there is a race somewhere.
> > > > >>
> > > > >>
> > > > >> Here are some findings. I've looked at time measures from CI for
> > > > >> https://review.openstack.org/#/c/448533/ which provided the most
> > > > >> recent results:
> > > > >>
> > > > >> * gate-tripleo-ci-centos-7-ovb-ha [1]
> > > > >>    undercloud install: 23
> > > > >>    overcloud deploy: 72
> > > > >>    total time: 125
> > > > >> * gate-tripleo-ci-centos-7-ovb-nonha [2]
> > > > >>    undercloud install: 25
> > > > >>    overcloud deploy: 48
> > > > >>    total time: 122
> > > > >> * gate-tripleo-ci-centos-7-ovb-updates [3]
> > > > >>    undercloud install: 24
> > > > >>    overcloud deploy: 57
> > > > >>    total time: 152
> > > > >> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> > > > >>    undercloud install: 28
> > > > >>    overcloud deploy: 48
> > > > >>    total time: 165 (timeout)
> > > > >>
> > > > >> Looking at the undercloud & overcloud install times, the most task
> > > > >> consuming tasks, the containers job isn't doing that bad compared
> to
> > > > >> other OVB jobs. But looking closer I could see that:
> > > > >> - the containers job pulls docker images from dockerhub, this
> process
> > > > >> takes roughly 18 min.
> > > > >
> > > > >
> > > > > I think we can optimize this a bit by having the script that
> populates
> > > > the
> > > > > local
> > > > > registry in the overcloud job to run in parallel. The docker
> daemon can
> > > > do
> > > > > multiple pulls w/o problems.
> > > > >
> > > > >> - the overcloud validate task takes 10 min more than it should
> because
> > > > >> of the bug Dan mentioned (a fix is in the queue at
> > > > >> https://review.openstack.org/#/c/448575/)
> > > > >
> > > > >
> > > > > +A
> > > > >
> > > > >> - the postci takes a long time with quickstart, 13 min (4 min
> alone
> > > > >> spent on docker log collection) whereas it takes only 3 min when
> using
> > > > >> tripleo.sh
> > > > >
> > > > >
> > > > > mmh, does this have anything to do with ansible being in between?
> Or is
> > > > that
> > > > > time specifically for the part that gets the logs?
> > > > >
> > > > >>
> > > > >> Adding all these numbers, we're at about 40 min of additional
> time for
> > > > >> oooq containers job which is enough to cross the CI job limit.
> > > > >>
> > > > >> There is certainly a lot of room for optimization here and there
> and
> > > > >> I'll explore how we can speed up the containers CI job over the
> next
> > > > >
> > > > >
> > > > > Thanks a lot for the update. The time break down is fantastic,
> > > > > Flavio
> > > >
> > > > TBH the problem is far from being solved:
> > > >
> > > > 1. Click on https://status-tripleoci.rhcloud.com/
> > > > 2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv
> > > >
> > > > Container job has been failing more than 55% of the time.
> > > >
> > > > As a reference,
> > > > gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
> > > > gate-tripleo-ci-centos-7-ovb-ha has 64% of success.
> > > >
> > > > It clearly means the ovb-containers job was and is not ready to be
> run
> > > > in the check pipeline, it's not reliable enough.
> > > >
> > > > The current queue time in TripleO OVB is 11 hours. This is not
> > > > acceptable for TripleO developers and we need a short term solution,
> > > > which is disabling this job from the check pipeline:
> > > > https://review.openstack.org/#/c/451546/
> > > >
> > > >
> > > Yes, given resource constraints I don't see an alternative in the short
> > > term.
> > >
> > >
> > > > On the long-term, we need to:
> > > >
> > > > - Stabilize ovb-containers which is AFIK already WIP by Martin (kudos
> > > > to him). My hope is Martin gets enough help from Container squad to
> > > > work on this topic.
> > > > - Remove ovb-nonha scenario from the check pipeline - and probably
> > > > keep it periodic. Dan Prince started some work on it:
> > > > https://review.openstack.org/#/c/449791/ and
> > > > https://review.openstack.org/#/c/449785/ - but not much progress on
> it
> > > > in the recent days.
> > > > - Engage some work on getting multinode-scenario(001,002,003,004)
> jobs
> > > > for containers, so we don't need much OVB jobs (only one probably)
> for
> > > > container scenarios.
> > > >
> > > >
> > > Another work item in progress which should help with the stability of
> the
> > > ovb containers job is Dan has set up a docker-distribution based
> registry
> > > on a node in rhcloud. Once jobs are pulling images from this there
> should
> > > be less timeouts due to image pull speed.
> > >
> > Before we go and stand up private infrastructure for tripleo to depend
> on, can
> > we please work on solving this is for all openstack projects upstream?
> We do
> > want to run regional mirrors for docker things, however we need to
> address
> > issues on how to integration this with AFS.
> >
> > We are trying to break the cycle of tripleo standing up private
> infrastructure
> > and consume more community based. So far we are making good progress,
> however I
> > would see this effort a step backwards, not forward.
>
> To be fair, we discussed this on IRC yesterday, everyone agreed infra
> supported docker cache/registry was a great idea, but you said there was no
> known timeline for it actually getting done.
>
> So while we all want to see that happen, and potentially help out with the
> effort, we're also trying to mitigate the fact that work isn't done by
> working around it in our OVB environment.
>
> FWIW I think we absolutely need multinode container jobs, e.g using infra
> resources, as that has worked out great for our puppet based CI, but we
> really need to work out how to optimize the container download speed in
> that environment before that will work well AFAIK.
>

Gabriele has started working on this
https://review.openstack.org/#/c/454152/



>
> You referenced https://review.openstack.org/#/c/447524/ in your other
> reply, which AFAICS is a spec about publishing to dockerhub, which sounds
> great, but we have the opposite problem, we need to consume those published
> images during our CI runs, and currently downloading images takes too long.
> So we ideally need some sort of local registry/pull-through-cache that
> speeds up that process.
>
> How can we move forward here, is there anyone on the infra side we can work
> with to discuss further?
>
> Thanks!
>
> Steve
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170406/dd6fff06/attachment.html>


More information about the OpenStack-dev mailing list