[openstack-dev] [heat][infra] Help needed! high gate failure rate

Rabi Mishra ramishra at redhat.com
Thu Aug 10 13:52:42 UTC 2017


On Thu, Aug 10, 2017 at 4:34 PM, Rabi Mishra <ramishra at redhat.com> wrote:

> On Thu, Aug 10, 2017 at 2:51 PM, Ian Wienand <iwienand at redhat.com> wrote:
>
>> On 08/10/2017 06:18 PM, Rico Lin wrote:
>> > We're facing a high failure rate in Heat's gates [1], four of our gate
>> > suffering with fail rate from 6 to near 20% in 14 days. which makes
>> most of
>> > our patch stuck with the gate.
>>
>> There have been a confluence of things causing some problems recently.
>> The loss of OSIC has distributed more load over everything else, and
>> we have seen an increase in job timeouts and intermittent networking
>> issues (especially if you're downloading large things from remote
>> sites).  There have also been some issues with the mirror in rax-ord
>> [1]
>>
>> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-ubuntu-xenial(19.67%)
>> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-non-apache-
>> ubuntu-xenia(9.09%)
>> > gate-heat-dsvm-functional-orig-mysql-lbaasv2-ubuntu-xenial(8.47%)
>> > gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial(6.00%)
>>
>> > We still try to find out what's the cause but (IMO,) seems it might be
>> some
>> > thing wrong with our infra. We need some help from infra team, to know
>> if
>> > any clue on this failure rate?
>>
>> The reality is you're just going to have to triage this and be a *lot*
>> more specific with issues.
>
>
> One of the issues we see recently is that, many jobs killed mid way
> through the tests as the job times out(120 mins).  It seems jobs are many
> times scheduled to very slow nodes, where setting up devstack takes more
> than 80 mins[1].
>
> [1] http://logs.openstack.org/49/492149/2/check/gate-heat-dsvm-
> functional-orig-mysql-lbaasv2-ubuntu-xenial/03b05dd/console.
> html#_2017-08-10_05_55_49_035693
>
> We download an image from a fedora mirror and it seems to take more than
1hr.

http://logs.openstack.org/41/484741/7/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/a797010/logs/devstacklog.txt.gz#_2017-08-10_04_13_14_400

Probably an issue with the specific mirror or some infra network bandwidth
issue. I've submitted a patch to change the mirror to see if that helps.


> I find opening an etherpad and going
>> through the failures one-by-one helpful (e.g. I keep [2] for centos
>> jobs I'm interested in).
>>
>> Looking at the top of the console.html log you'll have the host and
>> provider/region stamped in there.  If it's timeouts or network issues,
>> reporting to infra the time, provider and region of failing jobs will
>> help.  If it's network issues similar will help.  Finding patterns is
>> the first step to understanding what needs fixing.
>>
>> If it's due to issues with remote transfers, we can look at either
>> adding specific things to mirrors (containers, images, packages are
>> all things we've added recently) or adding a caching reverse-proxy for
>> them ([3],[4] some examples).
>>
>> Questions in #openstack-infra will usually get a helpful response too
>>
>> Good luck :)
>>
>> -i
>>
>> [1] https://bugs.launchpad.net/openstack-gate/+bug/1708707/
>> [2] https://etherpad.openstack.org/p/centos7-dsvm-triage
>> [3] https://review.openstack.org/491800
>> [4] https://review.openstack.org/491466
>>
>> ____________________________________________________________
>> ______________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
>
> --
> Regards,
> Rabi Misra
>
>


-- 
Regards,
Rabi Mishra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170810/1f7fdf25/attachment.html>


More information about the OpenStack-dev mailing list