[openstack-dev] [tripleo] Gate is broken - Do not approve any patch until further notice

Paul Belanger pabelanger at redhat.com
Wed Aug 30 12:33:09 UTC 2017


On Wed, Aug 30, 2017 at 11:31:14AM +0200, Bogdan Dobrelya wrote:
> On 30.08.2017 6:54, Emilien Macchi wrote:
> > On Tue, Aug 29, 2017 at 4:17 PM, Emilien Macchi <emilien at redhat.com> wrote:
> >> We are currently dealing with 4 issues and until they are fix, please
> >> do not approve any patch. We want to keep the gate clear to merge the
> >> fixes for the 4 problems first.
> >>
> >> 1) devstack-gate broke us because we use it as a library (bad)
> >> https://bugs.launchpad.net/tripleo/+bug/1713868
> >>
> >> 2) https://review.openstack.org/#/c/474578/ broke us and we're
> >> reverting it https://bugs.launchpad.net/tripleo/+bug/1713832
> >>
> >> 3) We shouldn't build images on multinode jobs
> >> https://bugs.launchpad.net/tripleo/+bug/1713167
> >>
> >> 4) We should use pip instead of git for delorean
> >> https://bugs.launchpad.net/tripleo/+bug/1708832
> >>
> >>
> >> Until further notice from Alex or myself, please do not approve any patch.
> > 
> > The 4 problems have been mitigated.
> > You can now proceed to normal review.
> > 
> > Please do not recheck a patch without an elastic-recheck comment, we
> > need to track all issues related to CI from now.
> > Paul Belanger has been doing extremely useful work to help us, now
> > let's use elastic-recheck more and stop blind rechecks.
> > All known issues are in http://status.openstack.org/elastic-recheck/
> > If one is missing, you're welcome to contribute by sending a patch to
> > elastic-recheck. Example with https://review.openstack.org/#/c/498954/
> 
> That's a great example! Let me follow up on that and share my beginner's
> experience as well.
> 
> Let's help with improving elastic-recheck queries to identify those
> unknown or new failures, this is really important. This also trains
> domain knowledge for particular areas, either openstack or *-infra, or
> tripleo specific.
> 
> As beginners, we could start with watching for failing tripleo-ci
> periodic [0],[1] (available as RSS feeds) and gate jobs without e-r
> comments, also from that page [2].
> 
> Then fetching the logs locally with tools like getthelogs [3], or
> looking into the logs.openstack.org directly, if advanced beginners wish so.
> 
> Finally, identifying discovered (just do some grep, like I do with my
> tool [4]) errorish patterns and helping with root cause analysis. And,
> ideally, submitting new e-r queries (see also [5]) and corresponding lp
> bugs. And absolutely ideally, help with addressing those as well. This
> might be hard though as we may be not experts in some of the areas. Some
> of the error messages would literally mean nothing to us. For me, the
> most  But as the best effort, we could invite the right persons to
> look into that, or at least ask folks on #tripleo or #openstack-infra.
> 
> [0]
> http://status.openstack.org/openstack-health/#/g/project/openstack-infra~2Ftripleo-ci
> [1]
> http://status.openstack.org/openstack-health/#/g/project/openstack~2Ftripleo-quickstart
> [2] http://status.openstack.org/elastic-recheck/data/others.html
> [3] https://review.openstack.org/#/c/492178/
> [4] https://github.com/bogdando/fuel-log-parse/blob/master/fuel-log-parse.sh
> [5]
> https://docs.openstack.org/infra/elastic-recheck/readme.html#running-queries-locally
> 
> > 
> > I've restored all patches that were killed from the gate and did
> > recheck already, hopefully we can get some merges and finish this
> > release.
> > 
> > Thanks Paul and all Infra for their consistent help!
> > 
> 
Indeed, this look much better this morning! Thanks to everybody on jumping on
the fixes.

Regarding Bug 1713832 - Object PUT failed for zaqar_subscription[1], which was
reverted last night. That is a great example to showcase elastic-recheck,
basically if you look back at the logstash queries, you can see the signs
pointing to an issue, but unfortunatly wasn't picked up until yesterday.

The info above from Bogdan is great, the general idea is, if a job fails in the
check pipeline and elastic-recheck doesn't leave a comment, it is likely a new
failure. Moving forward, we need to keep the blind rechecks to a minimum, as
each time we do so, we have the potential for breaking the gate down the road.

This is why you see tripleo pushing upwards of 16hr+ jobs on status.o.o/zuul,
because there was a job failure, and we had to rerun all patches again.

Keep up the good work, and look forward to talking more about this at PTG.

[1] http://status.openstack.org/elastic-recheck/#1713832



More information about the OpenStack-dev mailing list