[openstack-dev] [heat][infra] Help needed! high gate failure rate
iwienand at redhat.com
Thu Aug 10 09:21:19 UTC 2017
On 08/10/2017 06:18 PM, Rico Lin wrote:
> We're facing a high failure rate in Heat's gates , four of our gate
> suffering with fail rate from 6 to near 20% in 14 days. which makes most of
> our patch stuck with the gate.
There have been a confluence of things causing some problems recently.
The loss of OSIC has distributed more load over everything else, and
we have seen an increase in job timeouts and intermittent networking
issues (especially if you're downloading large things from remote
sites). There have also been some issues with the mirror in rax-ord
> We still try to find out what's the cause but (IMO,) seems it might be some
> thing wrong with our infra. We need some help from infra team, to know if
> any clue on this failure rate?
The reality is you're just going to have to triage this and be a *lot*
more specific with issues. I find opening an etherpad and going
through the failures one-by-one helpful (e.g. I keep  for centos
jobs I'm interested in).
Looking at the top of the console.html log you'll have the host and
provider/region stamped in there. If it's timeouts or network issues,
reporting to infra the time, provider and region of failing jobs will
help. If it's network issues similar will help. Finding patterns is
the first step to understanding what needs fixing.
If it's due to issues with remote transfers, we can look at either
adding specific things to mirrors (containers, images, packages are
all things we've added recently) or adding a caching reverse-proxy for
them (, some examples).
Questions in #openstack-infra will usually get a helpful response too
Good luck :)
More information about the OpenStack-dev