[openstack-dev] [kolla] Stability and reliability of gate jobs

Steven Dake (stdake) stdake at cisco.com
Thu Jul 7 13:23:42 UTC 2016



On 7/6/16, 5:50 PM, "Paul Belanger" <pabelanger at redhat.com> wrote:

>On Thu, Jun 16, 2016 at 12:20:06PM +0000, Steven Dake (stdake) wrote:
>> David,
>> 
>> The gates are unreliable for a variety of reasons - some we can fix -
>>some
>> we can't directly.
>> 
>> RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
>> reliably to drop dramatically.  Prior to this change, our gate was
>>running
>> 95% reliability or better - assuming the code wasn¹t busted.
>> The gate gear is different - meaning different setup.  We have been
>> working on debugging all these various gate provider issues with infra
>> team and I think that is mostly concluded.
>> The gate changed to something called bindeps which has been less
>>reliable
>> for us.
>
>I would be curious to hear your issues with bindep. A quick look at kolla
>show
>you are not using other-requirements.txt yet, so you are using our default
>fallback.txt file. I am unsure how that could be impacting you.
>
>> We do not have mirrors of CentOS repos - although it is in the works.
>> Mirrors will ensure that images always get built.  At the moment many of
>> the gate failures are triggered by build failures (the mirrors are too
>> busy).
>
>This is no longer the case, openstack-infra is now mirroring both
>centos-7[1]
>and epel-7[2]. And just this week we brought Ubuntu Cloud Archive[3]
>online. It
>would be pretty trivial to update kolla to start using them.
>
>[1] http://mirror.dfw.rax.openstack.org/centos/7/
>[2] http://mirror.dfw.rax.openstack.org/epel/7/
>[3] http://mirror.dfw.rax.openstack.org/ubuntu-cloud-archive/

Thanks I was aware that infra made mirrors available; I have not had a
chance to personally modify the gate to make use of these mirrors.

I am not sure if there is an issue with bindep or not.  A whole lot of
things changed at once and our gate went from pretty stable to super
unstable.  One of those things was bindeps but there were a bunch of other
changes.  I wouldn't pin it all on binddep.
 
>
>> We do not have mirrors of the other 5-10 repos and files we use.  This
>> causes more build failures.
>> 
>We do have the infrastructure in AFS to do this, it would require you to
>write
>the patch and submit it to openstack-infra so we can bring it online.  In
>fact,
>the OpenStack Ansible team was responsible for UCA mirror above, I simply
>did
>the last 5% to bring it into production.

Wow that’s huge!  I was not aware of this.  Do you have an example patch
which brings a mirror into service??

Thanks
-steve

>
>> Complicating matters, any of theses 5 things above can crater one gate
>>job
>> of which we run about 15 jobs, which causes the entire gate to fail (if
>> they were voting).  I really want a voting gate for kolla's jobs.  I
>>super
>> want it.  The reason we can't make the gates voting at this time is
>> because of the sheer unreliability of the gate.
>> 
>> If anyone is up for a thorough analysis of *why* the gates are failing,
>> that would help us fix them.
>> 
>> Regards
>> -steve
>> 
>> On 6/15/16, 3:27 AM, "Paul Bourke" <paul.bourke at oracle.com> wrote:
>> 
>> >Hi David,
>> >
>> >I agree with this completely. Gates continue to be a problem for Kolla,
>> >reasons why have been discussed in the past but at least for me it's
>>not
>> >clear what the key issues are.
>> >
>> >I've added this item to agenda for todays IRC meeting (16:00 UTC -
>> >https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
>> >hand we can brainstorm a list of the most common problems here
>>beforehand.
>> >
>> >To kick things off, rabbitmq seems to cause a disproportionate amount
>>of
>> >issues, and the problems are difficult to diagnose, particularly when
>> >the only way to debug is to summit "DO NOT MERGE" patch sets over and
>> >over. Here's an example of a failed centos binary gate from a simple
>> >patch set I was reviewing this morning:
>> 
>>>http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-cento
>>>s-
>> >binary/3486d03/console.html#_2016-06-14_15_36_19_425413
>> >
>> >Cheers,
>> >-Paul
>> >
>> >On 15/06/16 04:26, David Moreau Simard wrote:
>> >> Hi Kolla o/
>> >>
>> >> I'm writing to you because I'm concerned.
>> >>
>> >> In case you didn't already know, the RDO community collaborates with
>> >> upstream deployment and installation projects to test it's packaging.
>> >>
>> >> This relationship is beneficial in a lot of ways for both parties, in
>> >>summary:
>> >> - RDO has improved test coverage (because it's otherwise hard to test
>> >> different ways of installing, configuring and deploying OpenStack by
>> >> ourselves)
>> >> - The RDO community works with upstream projects (deployment or core
>> >> projects) to fix issues that we find
>> >> - In return, the collaborating deployment project can feel more
>> >> confident that the RDO packages it consumes have already been tested
>> >> using it's platform and should work
>> >>
>> >> To make a long story short, we do this with a project called WeIRDO
>> >> [1] which essentially runs gate jobs outside of the gate.
>> >>
>> >> I tried to get Kolla in our testing pipeline during the Mitaka cycle.
>> >> I really did.
>> >> I contributed the necessary features I needed in Kolla in order to
>> >> make this work, like the configurable Yum repositories for example.
>> >>
>> >> However, in the end, I had to put off the initiative because the gate
>> >> jobs were very flappy and unreliable.
>> >> We cannot afford to have a job that is *expected* to flap in our
>> >> testing pipeline, it leads to a lot of wasted time, effort and
>> >> resources.
>> >>
>> >> I think there's been a lot of improvements since my last attempt but
>> >> to get a sample of data, I looked at ~30 recently merged reviews.
>> >> Of 260 total build/deploy jobs, 55 (or over 20%) failed -- and I
>> >> didn't account for rechecks, just the last known status of the check
>> >> jobs.
>> >> I put up the results of those jobs here [2].
>> >>
>> >> In the case that interests me most, CentOS binary jobs, it's 5
>> >> failures out of 50 jobs, so 10%. Not as bad but still a concern for
>> >> me.
>> >>
>> >> Other deployment projects like Puppet-OpenStack, OpenStack Ansible,
>> >> Packstack and TripleO have quite a bit of *voting* integration
>>testing
>> >> jobs.
>> >> Why are Kolla's jobs non-voting and so unreliable ?
>> >>
>> >> Thanks,
>> >>
>> >> [1]: https://github.com/rdo-infra/weirdo
>> >> [2]: 
>> 
>>>>https://docs.google.com/spreadsheets/d/1NYyMIDaUnlOD2wWuioAEOhjeVmZe7Q8
>>>>_z
>> >>dFfuLjquG4/edit#gid=0
>> >>
>> >> David Moreau Simard
>> >> Senior Software Engineer | Openstack RDO
>> >>
>> >> dmsimard = [irc, github, twitter]
>> >>
>> >> 
>> 
>>>>_______________________________________________________________________
>>>>__
>> >>_
>> >> OpenStack Development Mailing List (not for usage questions)
>> >> Unsubscribe: 
>> >>OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >>
>> >
>> 
>>>________________________________________________________________________
>>>__
>> >OpenStack Development Mailing List (not for usage questions)
>> >Unsubscribe: 
>>OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> >http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> 
>> 
>> 
>>_________________________________________________________________________
>>_
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: 
>>OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>__________________________________________________________________________
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list