[openstack-dev] [kolla] Stability and reliability of gate jobs

Paul Bourke paul.bourke at oracle.com
Wed Jun 15 10:27:09 UTC 2016

Hi David,

I agree with this completely. Gates continue to be a problem for Kolla, 
reasons why have been discussed in the past but at least for me it's not 
clear what the key issues are.

I've added this item to agenda for todays IRC meeting (16:00 UTC - 
https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before 
hand we can brainstorm a list of the most common problems here beforehand.

To kick things off, rabbitmq seems to cause a disproportionate amount of 
issues, and the problems are difficult to diagnose, particularly when 
the only way to debug is to summit "DO NOT MERGE" patch sets over and 
over. Here's an example of a failed centos binary gate from a simple 
patch set I was reviewing this morning: 


On 15/06/16 04:26, David Moreau Simard wrote:
> Hi Kolla o/
> I'm writing to you because I'm concerned.
> In case you didn't already know, the RDO community collaborates with
> upstream deployment and installation projects to test it's packaging.
> This relationship is beneficial in a lot of ways for both parties, in summary:
> - RDO has improved test coverage (because it's otherwise hard to test
> different ways of installing, configuring and deploying OpenStack by
> ourselves)
> - The RDO community works with upstream projects (deployment or core
> projects) to fix issues that we find
> - In return, the collaborating deployment project can feel more
> confident that the RDO packages it consumes have already been tested
> using it's platform and should work
> To make a long story short, we do this with a project called WeIRDO
> [1] which essentially runs gate jobs outside of the gate.
> I tried to get Kolla in our testing pipeline during the Mitaka cycle.
> I really did.
> I contributed the necessary features I needed in Kolla in order to
> make this work, like the configurable Yum repositories for example.
> However, in the end, I had to put off the initiative because the gate
> jobs were very flappy and unreliable.
> We cannot afford to have a job that is *expected* to flap in our
> testing pipeline, it leads to a lot of wasted time, effort and
> resources.
> I think there's been a lot of improvements since my last attempt but
> to get a sample of data, I looked at ~30 recently merged reviews.
> Of 260 total build/deploy jobs, 55 (or over 20%) failed -- and I
> didn't account for rechecks, just the last known status of the check
> jobs.
> I put up the results of those jobs here [2].
> In the case that interests me most, CentOS binary jobs, it's 5
> failures out of 50 jobs, so 10%. Not as bad but still a concern for
> me.
> Other deployment projects like Puppet-OpenStack, OpenStack Ansible,
> Packstack and TripleO have quite a bit of *voting* integration testing
> jobs.
> Why are Kolla's jobs non-voting and so unreliable ?
> Thanks,
> [1]: https://github.com/rdo-infra/weirdo
> [2]: https://docs.google.com/spreadsheets/d/1NYyMIDaUnlOD2wWuioAEOhjeVmZe7Q8_zdFfuLjquG4/edit#gid=0
> David Moreau Simard
> Senior Software Engineer | Openstack RDO
> dmsimard = [irc, github, twitter]
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

More information about the OpenStack-dev mailing list