[openstack-dev] [kolla] Stability and reliability of gate jobs

Paul Belanger pabelanger at redhat.com
Thu Jul 7 00:50:22 UTC 2016


On Thu, Jun 16, 2016 at 12:20:06PM +0000, Steven Dake (stdake) wrote:
> David,
> 
> The gates are unreliable for a variety of reasons - some we can fix - some
> we can't directly.
> 
> RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
> reliably to drop dramatically.  Prior to this change, our gate was running
> 95% reliability or better - assuming the code wasn¹t busted.
> The gate gear is different - meaning different setup.  We have been
> working on debugging all these various gate provider issues with infra
> team and I think that is mostly concluded.
> The gate changed to something called bindeps which has been less reliable
> for us.

I would be curious to hear your issues with bindep. A quick look at kolla show
you are not using other-requirements.txt yet, so you are using our default
fallback.txt file. I am unsure how that could be impacting you.

> We do not have mirrors of CentOS repos - although it is in the works.
> Mirrors will ensure that images always get built.  At the moment many of
> the gate failures are triggered by build failures (the mirrors are too
> busy).

This is no longer the case, openstack-infra is now mirroring both centos-7[1]
and epel-7[2]. And just this week we brought Ubuntu Cloud Archive[3] online. It
would be pretty trivial to update kolla to start using them.

[1] http://mirror.dfw.rax.openstack.org/centos/7/
[2] http://mirror.dfw.rax.openstack.org/epel/7/
[3] http://mirror.dfw.rax.openstack.org/ubuntu-cloud-archive/

> We do not have mirrors of the other 5-10 repos and files we use.  This
> causes more build failures.
> 
We do have the infrastructure in AFS to do this, it would require you to write
the patch and submit it to openstack-infra so we can bring it online.  In fact,
the OpenStack Ansible team was responsible for UCA mirror above, I simply did
the last 5% to bring it into production.

> Complicating matters, any of theses 5 things above can crater one gate job
> of which we run about 15 jobs, which causes the entire gate to fail (if
> they were voting).  I really want a voting gate for kolla's jobs.  I super
> want it.  The reason we can't make the gates voting at this time is
> because of the sheer unreliability of the gate.
> 
> If anyone is up for a thorough analysis of *why* the gates are failing,
> that would help us fix them.
> 
> Regards
> -steve
> 
> On 6/15/16, 3:27 AM, "Paul Bourke" <paul.bourke at oracle.com> wrote:
> 
> >Hi David,
> >
> >I agree with this completely. Gates continue to be a problem for Kolla,
> >reasons why have been discussed in the past but at least for me it's not
> >clear what the key issues are.
> >
> >I've added this item to agenda for todays IRC meeting (16:00 UTC -
> >https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
> >hand we can brainstorm a list of the most common problems here beforehand.
> >
> >To kick things off, rabbitmq seems to cause a disproportionate amount of
> >issues, and the problems are difficult to diagnose, particularly when
> >the only way to debug is to summit "DO NOT MERGE" patch sets over and
> >over. Here's an example of a failed centos binary gate from a simple
> >patch set I was reviewing this morning:
> >http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-centos-
> >binary/3486d03/console.html#_2016-06-14_15_36_19_425413
> >
> >Cheers,
> >-Paul
> >
> >On 15/06/16 04:26, David Moreau Simard wrote:
> >> Hi Kolla o/
> >>
> >> I'm writing to you because I'm concerned.
> >>
> >> In case you didn't already know, the RDO community collaborates with
> >> upstream deployment and installation projects to test it's packaging.
> >>
> >> This relationship is beneficial in a lot of ways for both parties, in
> >>summary:
> >> - RDO has improved test coverage (because it's otherwise hard to test
> >> different ways of installing, configuring and deploying OpenStack by
> >> ourselves)
> >> - The RDO community works with upstream projects (deployment or core
> >> projects) to fix issues that we find
> >> - In return, the collaborating deployment project can feel more
> >> confident that the RDO packages it consumes have already been tested
> >> using it's platform and should work
> >>
> >> To make a long story short, we do this with a project called WeIRDO
> >> [1] which essentially runs gate jobs outside of the gate.
> >>
> >> I tried to get Kolla in our testing pipeline during the Mitaka cycle.
> >> I really did.
> >> I contributed the necessary features I needed in Kolla in order to
> >> make this work, like the configurable Yum repositories for example.
> >>
> >> However, in the end, I had to put off the initiative because the gate
> >> jobs were very flappy and unreliable.
> >> We cannot afford to have a job that is *expected* to flap in our
> >> testing pipeline, it leads to a lot of wasted time, effort and
> >> resources.
> >>
> >> I think there's been a lot of improvements since my last attempt but
> >> to get a sample of data, I looked at ~30 recently merged reviews.
> >> Of 260 total build/deploy jobs, 55 (or over 20%) failed -- and I
> >> didn't account for rechecks, just the last known status of the check
> >> jobs.
> >> I put up the results of those jobs here [2].
> >>
> >> In the case that interests me most, CentOS binary jobs, it's 5
> >> failures out of 50 jobs, so 10%. Not as bad but still a concern for
> >> me.
> >>
> >> Other deployment projects like Puppet-OpenStack, OpenStack Ansible,
> >> Packstack and TripleO have quite a bit of *voting* integration testing
> >> jobs.
> >> Why are Kolla's jobs non-voting and so unreliable ?
> >>
> >> Thanks,
> >>
> >> [1]: https://github.com/rdo-infra/weirdo
> >> [2]: 
> >>https://docs.google.com/spreadsheets/d/1NYyMIDaUnlOD2wWuioAEOhjeVmZe7Q8_z
> >>dFfuLjquG4/edit#gid=0
> >>
> >> David Moreau Simard
> >> Senior Software Engineer | Openstack RDO
> >>
> >> dmsimard = [irc, github, twitter]
> >>
> >> 
> >>_________________________________________________________________________
> >>_
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe: 
> >>OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>
> >
> >__________________________________________________________________________
> >OpenStack Development Mailing List (not for usage questions)
> >Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list