[openstack-dev] [kolla] Stability and reliability of gate jobs

Steven Dake (stdake) stdake at cisco.com
Thu Jun 16 12:20:06 UTC 2016


David,

The gates are unreliable for a variety of reasons - some we can fix - some
we can't directly.

RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
reliably to drop dramatically.  Prior to this change, our gate was running
95% reliability or better - assuming the code wasn¹t busted.
The gate gear is different - meaning different setup.  We have been
working on debugging all these various gate provider issues with infra
team and I think that is mostly concluded.
The gate changed to something called bindeps which has been less reliable
for us.
We do not have mirrors of CentOS repos - although it is in the works.
Mirrors will ensure that images always get built.  At the moment many of
the gate failures are triggered by build failures (the mirrors are too
busy).
We do not have mirrors of the other 5-10 repos and files we use.  This
causes more build failures.

Complicating matters, any of theses 5 things above can crater one gate job
of which we run about 15 jobs, which causes the entire gate to fail (if
they were voting).  I really want a voting gate for kolla's jobs.  I super
want it.  The reason we can't make the gates voting at this time is
because of the sheer unreliability of the gate.

If anyone is up for a thorough analysis of *why* the gates are failing,
that would help us fix them.

Regards
-steve

On 6/15/16, 3:27 AM, "Paul Bourke" <paul.bourke at oracle.com> wrote:

>Hi David,
>
>I agree with this completely. Gates continue to be a problem for Kolla,
>reasons why have been discussed in the past but at least for me it's not
>clear what the key issues are.
>
>I've added this item to agenda for todays IRC meeting (16:00 UTC -
>https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
>hand we can brainstorm a list of the most common problems here beforehand.
>
>To kick things off, rabbitmq seems to cause a disproportionate amount of
>issues, and the problems are difficult to diagnose, particularly when
>the only way to debug is to summit "DO NOT MERGE" patch sets over and
>over. Here's an example of a failed centos binary gate from a simple
>patch set I was reviewing this morning:
>http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-centos-
>binary/3486d03/console.html#_2016-06-14_15_36_19_425413
>
>Cheers,
>-Paul
>
>On 15/06/16 04:26, David Moreau Simard wrote:
>> Hi Kolla o/
>>
>> I'm writing to you because I'm concerned.
>>
>> In case you didn't already know, the RDO community collaborates with
>> upstream deployment and installation projects to test it's packaging.
>>
>> This relationship is beneficial in a lot of ways for both parties, in
>>summary:
>> - RDO has improved test coverage (because it's otherwise hard to test
>> different ways of installing, configuring and deploying OpenStack by
>> ourselves)
>> - The RDO community works with upstream projects (deployment or core
>> projects) to fix issues that we find
>> - In return, the collaborating deployment project can feel more
>> confident that the RDO packages it consumes have already been tested
>> using it's platform and should work
>>
>> To make a long story short, we do this with a project called WeIRDO
>> [1] which essentially runs gate jobs outside of the gate.
>>
>> I tried to get Kolla in our testing pipeline during the Mitaka cycle.
>> I really did.
>> I contributed the necessary features I needed in Kolla in order to
>> make this work, like the configurable Yum repositories for example.
>>
>> However, in the end, I had to put off the initiative because the gate
>> jobs were very flappy and unreliable.
>> We cannot afford to have a job that is *expected* to flap in our
>> testing pipeline, it leads to a lot of wasted time, effort and
>> resources.
>>
>> I think there's been a lot of improvements since my last attempt but
>> to get a sample of data, I looked at ~30 recently merged reviews.
>> Of 260 total build/deploy jobs, 55 (or over 20%) failed -- and I
>> didn't account for rechecks, just the last known status of the check
>> jobs.
>> I put up the results of those jobs here [2].
>>
>> In the case that interests me most, CentOS binary jobs, it's 5
>> failures out of 50 jobs, so 10%. Not as bad but still a concern for
>> me.
>>
>> Other deployment projects like Puppet-OpenStack, OpenStack Ansible,
>> Packstack and TripleO have quite a bit of *voting* integration testing
>> jobs.
>> Why are Kolla's jobs non-voting and so unreliable ?
>>
>> Thanks,
>>
>> [1]: https://github.com/rdo-infra/weirdo
>> [2]: 
>>https://docs.google.com/spreadsheets/d/1NYyMIDaUnlOD2wWuioAEOhjeVmZe7Q8_z
>>dFfuLjquG4/edit#gid=0
>>
>> David Moreau Simard
>> Senior Software Engineer | Openstack RDO
>>
>> dmsimard = [irc, github, twitter]
>>
>> 
>>_________________________________________________________________________
>>_
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: 
>>OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>__________________________________________________________________________
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




More information about the OpenStack-dev mailing list