[openstack-dev] [tripleo] container jobs are unstable

Steve Baker sbaker at redhat.com
Wed Mar 29 20:56:59 UTC 2017


On Thu, Mar 30, 2017 at 9:39 AM, Emilien Macchi <emilien at redhat.com> wrote:

> On Mon, Mar 27, 2017 at 8:00 AM, Flavio Percoco <flavio at redhat.com> wrote:
> > On 23/03/17 16:24 +0100, Martin André wrote:
> >>
> >> On Wed, Mar 22, 2017 at 2:20 PM, Dan Prince <dprince at redhat.com> wrote:
> >>>
> >>> On Wed, 2017-03-22 at 13:35 +0100, Flavio Percoco wrote:
> >>>>
> >>>> On 22/03/17 13:32 +0100, Flavio Percoco wrote:
> >>>> > On 21/03/17 23:15 -0400, Emilien Macchi wrote:
> >>>> > > Hey,
> >>>> > >
> >>>> > > I've noticed that container jobs look pretty unstable lately; to
> >>>> > > me,
> >>>> > > it sounds like a timeout:
> >>>> > > http://logs.openstack.org/19/447319/2/check-tripleo/gate-tripleo-
> >>>> > > ci-centos-7-ovb-containers-oooq-nv/bca496a/console.html#_2017-03-
> >>>> > > 22_00_08_55_358973
> >>>> >
> >>>> > There are different hypothesis on what is going on here. Some
> >>>> > patches have
> >>>> > landed to improve the write performance on containers by using
> >>>> > hostpath mounts
> >>>> > but we think the real slowness is coming from the images download.
> >>>> >
> >>>> > This said, this is still under investigation and the containers
> >>>> > squad will
> >>>> > report back as soon as there are new findings.
> >>>>
> >>>> Also, to be more precise, Martin André is looking into this. He also
> >>>> fixed the
> >>>> gate in the last 2 weeks.
> >>>
> >>>
> >>> I spoke w/ Martin on IRC. He seems to think this is the cause of some
> >>> of the failures:
> >>>
> >>> http://logs.openstack.org/32/446432/1/check-tripleo/gate-
> tripleo-ci-cen
> >>> tos-7-ovb-containers-oooq-nv/543bc80/logs/oooq/overcloud-controller-
> >>> 0/var/log/extra/docker/containers/heat_engine/log/heat/heat-
> >>> engine.log.txt.gz#_2017-03-21_20_26_29_697
> >>>
> >>>
> >>> Looks like Heat isn't able to create Nova instances in the overcloud
> >>> due to "Host 'overcloud-novacompute-0' is not mapped to any cell'. This
> >>> means our cells initialization code for containers may not be quite
> >>> right... or there is a race somewhere.
> >>
> >>
> >> Here are some findings. I've looked at time measures from CI for
> >> https://review.openstack.org/#/c/448533/ which provided the most
> >> recent results:
> >>
> >> * gate-tripleo-ci-centos-7-ovb-ha [1]
> >>    undercloud install: 23
> >>    overcloud deploy: 72
> >>    total time: 125
> >> * gate-tripleo-ci-centos-7-ovb-nonha [2]
> >>    undercloud install: 25
> >>    overcloud deploy: 48
> >>    total time: 122
> >> * gate-tripleo-ci-centos-7-ovb-updates [3]
> >>    undercloud install: 24
> >>    overcloud deploy: 57
> >>    total time: 152
> >> * gate-tripleo-ci-centos-7-ovb-containers-oooq-nv [4]
> >>    undercloud install: 28
> >>    overcloud deploy: 48
> >>    total time: 165 (timeout)
> >>
> >> Looking at the undercloud & overcloud install times, the most task
> >> consuming tasks, the containers job isn't doing that bad compared to
> >> other OVB jobs. But looking closer I could see that:
> >> - the containers job pulls docker images from dockerhub, this process
> >> takes roughly 18 min.
> >
> >
> > I think we can optimize this a bit by having the script that populates
> the
> > local
> > registry in the overcloud job to run in parallel. The docker daemon can
> do
> > multiple pulls w/o problems.
> >
> >> - the overcloud validate task takes 10 min more than it should because
> >> of the bug Dan mentioned (a fix is in the queue at
> >> https://review.openstack.org/#/c/448575/)
> >
> >
> > +A
> >
> >> - the postci takes a long time with quickstart, 13 min (4 min alone
> >> spent on docker log collection) whereas it takes only 3 min when using
> >> tripleo.sh
> >
> >
> > mmh, does this have anything to do with ansible being in between? Or is
> that
> > time specifically for the part that gets the logs?
> >
> >>
> >> Adding all these numbers, we're at about 40 min of additional time for
> >> oooq containers job which is enough to cross the CI job limit.
> >>
> >> There is certainly a lot of room for optimization here and there and
> >> I'll explore how we can speed up the containers CI job over the next
> >
> >
> > Thanks a lot for the update. The time break down is fantastic,
> > Flavio
>
> TBH the problem is far from being solved:
>
> 1. Click on https://status-tripleoci.rhcloud.com/
> 2. Select gate-tripleo-ci-centos-7-ovb-containers-oooq-nv
>
> Container job has been failing more than 55% of the time.
>
> As a reference,
> gate-tripleo-ci-centos-7-ovb-nonha has 90% of success.
> gate-tripleo-ci-centos-7-ovb-ha has 64% of success.
>
> It clearly means the ovb-containers job was and is not ready to be run
> in the check pipeline, it's not reliable enough.
>
> The current queue time in TripleO OVB is 11 hours. This is not
> acceptable for TripleO developers and we need a short term solution,
> which is disabling this job from the check pipeline:
> https://review.openstack.org/#/c/451546/
>
>
Yes, given resource constraints I don't see an alternative in the short
term.


> On the long-term, we need to:
>
> - Stabilize ovb-containers which is AFIK already WIP by Martin (kudos
> to him). My hope is Martin gets enough help from Container squad to
> work on this topic.
> - Remove ovb-nonha scenario from the check pipeline - and probably
> keep it periodic. Dan Prince started some work on it:
> https://review.openstack.org/#/c/449791/ and
> https://review.openstack.org/#/c/449785/ - but not much progress on it
> in the recent days.
> - Engage some work on getting multinode-scenario(001,002,003,004) jobs
> for containers, so we don't need much OVB jobs (only one probably) for
> container scenarios.
>
>
Another work item in progress which should help with the stability of the
ovb containers job is Dan has set up a docker-distribution based registry
on a node in rhcloud. Once jobs are pulling images from this there should
be less timeouts due to image pull speed.


> I know everyone is busy by working on container support in composable
> services, but we might assign more resources on CI work here,
> otherwise I'm not sure how we're going to stabilize the CI.
>
> Any feedback is very welcome.
>
> >
> >> weeks.
> >>
> >> Martin
> >>
> >> [1]
> >> http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> tripleo-ci-centos-7-ovb-ha/d2c1b16/
> >> [2]
> >> http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> tripleo-ci-centos-7-ovb-nonha/d6df760/
> >> [3]
> >> http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> tripleo-ci-centos-7-ovb-updates/3b1f795/
> >> [4]
> >> http://logs.openstack.org/33/448533/2/check-tripleo/gate-
> tripleo-ci-centos-7-ovb-containers-oooq-nv/b816f20/
> >>
> >>> Dan
> >>>
> >>>>
> >>>> Flavio
> >>>>
> >>>>
> >>>>
> >>>> _____________________________________________________________________
> >>>> _____
> >>>> OpenStack Development Mailing List (not for usage questions)
> >>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubs
> >>>> cribe
> >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>
> >>>
> >>>
> >>> ____________________________________________________________
> ______________
> >>> OpenStack Development Mailing List (not for usage questions)
> >>> Unsubscribe:
> >>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>
> >>
> >> ____________________________________________________________
> ______________
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:
> unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> > --
> > @flaper87
> > Flavio Percoco
> >
> > ____________________________________________________________
> ______________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:
> unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
>
> --
> Emilien Macchi
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170330/16cf8069/attachment.html>


More information about the OpenStack-dev mailing list