[openstack-dev] [heat][infra] Help needed! high gate failure rate
pabelanger at redhat.com
Thu Aug 10 16:04:25 UTC 2017
On Thu, Aug 10, 2017 at 07:22:42PM +0530, Rabi Mishra wrote:
> On Thu, Aug 10, 2017 at 4:34 PM, Rabi Mishra <ramishra at redhat.com> wrote:
> > On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienand <iwienand at redhat.com> wrote:
> >> On 08/10/2017 06:18 PM, Rico Lin wrote:
> >> > We're facing a high failure rate in Heat's gates , four of our gate
> >> > suffering with fail rate from 6 to near 20% in 14 days. which makes
> >> most of
> >> > our patch stuck with the gate.
> >> There have been a confluence of things causing some problems recently.
> >> The loss of OSIC has distributed more load over everything else, and
> >> we have seen an increase in job timeouts and intermittent networking
> >> issues (especially if you're downloading large things from remote
> >> sites). There have also been some issues with the mirror in rax-ord
> >> 
> >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%)
> >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache-
> >> ubuntu-xenia(9.09%)
> >> > gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%)
> >> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%)
> >> > We still try to find out what's the cause but (IMO,) seems it might be
> >> some
> >> > thing wrong with our infra. We need some help from infra team, to know
> >> if
> >> > any clue on this failure rate?
> >> The reality is you're just going to have to triage this and be a *lot*
> >> more specific with issues.
> > One of the issues we see recently is that, many jobs killed mid way
> > through the tests as the job times out(120 mins). It seems jobs are many
> > times scheduled to very slow nodes, where setting up devstack takes more
> > than 80 mins.
> >  http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm-
> > functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console.
> > html#_2017-08-10_05_55_49_035693
> > We download an image from a fedora mirror and it seems to take more than
> Probably an issue with the specific mirror or some infra network bandwidth
> issue. I've submitted a patch to change the mirror to see if that helps.
Today we mirror both fedora-26 and fedora-25 (to be removed shortly). So if
you want to consider bumping your image for testing, you can fetch it from our
You can source /etc/ci/mirror_info.sh to get information about things we mirror.
> > I find opening an etherpad and going
> >> through the failures one-by-one helpful (e.g. I keep  for centos
> >> jobs I'm interested in).
> >> Looking at the top of the console.html log you'll have the host and
> >> provider/region stamped in there. If it's timeouts or network issues,
> >> reporting to infra the time, provider and region of failing jobs will
> >> help. If it's network issues similar will help. Finding patterns is
> >> the first step to understanding what needs fixing.
> >> If it's due to issues with remote transfers, we can look at either
> >> adding specific things to mirrors (containers, images, packages are
> >> all things we've added recently) or adding a caching reverse-proxy for
> >> them (, some examples).
> >> Questions in #openstack-infra will usually get a helpful response too
> >> Good luck :)
> >> -i
> >>  https://bugs.launchpad.net/openstack-gate/+bug/1708707/
> >>  https://etherpad.openstack.org/p/centos7-dsvm-triage
> >>  https://review.openstack.org/491800
> >>  https://review.openstack.org/491466
> >> ____________________________________________________________
> >> ______________
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscrib
> >> e
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > --
> > Regards,
> > Rabi Misra
> Rabi Mishra
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
More information about the OpenStack-dev