[openstack-dev] [kolla] Stability and reliability of gate jobs

Steven Dake (stdake) stdake at cisco.com
Wed Jul 6 21:09:25 UTC 2016


David,

Thanks for the feedback.  We know we have more work to do on our
integration gate.  It is a matter of finding people that have been trained
on gating development to do gate work.

Regards
-steve

On 7/4/16, 12:39 PM, "David Moreau Simard" <dms at redhat.com> wrote:

>I mentioned this on IRC to some extent but I'm going to post it here
>for posterity.
>
>I think we can all agree that Integration tests are pretty darn
>important and I'm convinced I don't need to remind you why.
>I'm going to re-iterate that I am very concerned about the state of
>the jobs but also their coverage.
>
>Kolla provides an implementation for a lot of the big tents projects
>but they are not properly (if at all) tested in the gate.
>Only the core services are tested in an "all-in-one" fashion and if a
>commit happens to break a project that isn't tested in that all-in-one
>test, no one will know about it.
>
>This is very dangerous territory -- you can't guarantee that what
>Kolla supports really works on every commit.
>Both Packstack [1] and Puppet-OpenStack [2] have an extensive matrix
>of test coverage across different jobs and different operating systems
>to work around the memory constraints of the gate virtual machines.
>They test themselves with their project implementations in different
>ways (i.e, glance with file, glance with swift, cinder with lvm,
>cinder with ceph, neutron with ovs, neutron with linuxbridge, etc.)
>and do so successfully.
>
>I don't see why Kolla should be different if it is to be taken seriously.
>My apologies if it feels I am being harsh - I am being open and honest
>about Kolla's loss of credibility from my perspective.
>
>I've put my attempts to put Kolla in RDO's testing pipeline on hold
>for the Newton cycle.
>I hope we can straighten out all of this -- I care about Kolla and I
>want it to succeed, which is why I started this thread in the first
>place.
>
>While I don't really have the bandwidth to contribute to Kolla, I hope
>you can at least consider my feedback and you can also find me on IRC
>if you have questions.
>
>[1]: https://github.com/openstack/packstack#packstack-integration-tests
>[2]: https://github.com/openstack/puppet-openstack-integration#description
>
>David Moreau Simard
>Senior Software Engineer | Openstack RDO
>
>dmsimard = [irc, github, twitter]
>
>
>On Thu, Jun 16, 2016 at 8:20 AM, Steven Dake (stdake) <stdake at cisco.com>
>wrote:
>> David,
>>
>> The gates are unreliable for a variety of reasons - some we can fix -
>>some
>> we can't directly.
>>
>> RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
>> reliably to drop dramatically.  Prior to this change, our gate was
>>running
>> 95% reliability or better - assuming the code wasn¹t busted.
>> The gate gear is different - meaning different setup.  We have been
>> working on debugging all these various gate provider issues with infra
>> team and I think that is mostly concluded.
>> The gate changed to something called bindeps which has been less
>>reliable
>> for us.
>> We do not have mirrors of CentOS repos - although it is in the works.
>> Mirrors will ensure that images always get built.  At the moment many of
>> the gate failures are triggered by build failures (the mirrors are too
>> busy).
>> We do not have mirrors of the other 5-10 repos and files we use.  This
>> causes more build failures.
>>
>> Complicating matters, any of theses 5 things above can crater one gate
>>job
>> of which we run about 15 jobs, which causes the entire gate to fail (if
>> they were voting).  I really want a voting gate for kolla's jobs.  I
>>super
>> want it.  The reason we can't make the gates voting at this time is
>> because of the sheer unreliability of the gate.
>>
>> If anyone is up for a thorough analysis of *why* the gates are failing,
>> that would help us fix them.
>>
>> Regards
>> -steve
>>
>> On 6/15/16, 3:27 AM, "Paul Bourke" <paul.bourke at oracle.com> wrote:
>>
>>>Hi David,
>>>
>>>I agree with this completely. Gates continue to be a problem for Kolla,
>>>reasons why have been discussed in the past but at least for me it's not
>>>clear what the key issues are.
>>>
>>>I've added this item to agenda for todays IRC meeting (16:00 UTC -
>>>https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
>>>hand we can brainstorm a list of the most common problems here
>>>beforehand.
>>>
>>>To kick things off, rabbitmq seems to cause a disproportionate amount of
>>>issues, and the problems are difficult to diagnose, particularly when
>>>the only way to debug is to summit "DO NOT MERGE" patch sets over and
>>>over. Here's an example of a failed centos binary gate from a simple
>>>patch set I was reviewing this morning:
>>>http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-cento
>>>s-
>>>binary/3486d03/console.html#_2016-06-14_15_36_19_425413
>>>
>>>Cheers,
>>>-Paul
>>>
>>>On 15/06/16 04:26, David Moreau Simard wrote:
>>>> Hi Kolla o/
>>>>
>>>> I'm writing to you because I'm concerned.
>>>>
>>>> In case you didn't already know, the RDO community collaborates with
>>>> upstream deployment and installation projects to test it's packaging.
>>>>
>>>> This relationship is beneficial in a lot of ways for both parties, in
>>>>summary:
>>>> - RDO has improved test coverage (because it's otherwise hard to test
>>>> different ways of installing, configuring and deploying OpenStack by
>>>> ourselves)
>>>> - The RDO community works with upstream projects (deployment or core
>>>> projects) to fix issues that we find
>>>> - In return, the collaborating deployment project can feel more
>>>> confident that the RDO packages it consumes have already been tested
>>>> using it's platform and should work
>>>>
>>>> To make a long story short, we do this with a project called WeIRDO
>>>> [1] which essentially runs gate jobs outside of the gate.
>>>>
>>>> I tried to get Kolla in our testing pipeline during the Mitaka cycle.
>>>> I really did.
>>>> I contributed the necessary features I needed in Kolla in order to
>>>> make this work, like the configurable Yum repositories for example.
>>>>
>>>> However, in the end, I had to put off the initiative because the gate
>>>> jobs were very flappy and unreliable.
>>>> We cannot afford to have a job that is *expected* to flap in our
>>>> testing pipeline, it leads to a lot of wasted time, effort and
>>>> resources.
>>>>
>>>> I think there's been a lot of improvements since my last attempt but
>>>> to get a sample of data, I looked at ~30 recently merged reviews.
>>>> Of 260 total build/deploy jobs, 55 (or over 20%) failed -- and I
>>>> didn't account for rechecks, just the last known status of the check
>>>> jobs.
>>>> I put up the results of those jobs here [2].
>>>>
>>>> In the case that interests me most, CentOS binary jobs, it's 5
>>>> failures out of 50 jobs, so 10%. Not as bad but still a concern for
>>>> me.
>>>>
>>>> Other deployment projects like Puppet-OpenStack, OpenStack Ansible,
>>>> Packstack and TripleO have quite a bit of *voting* integration testing
>>>> jobs.
>>>> Why are Kolla's jobs non-voting and so unreliable ?
>>>>
>>>> Thanks,
>>>>
>>>> [1]: https://github.com/rdo-infra/weirdo
>>>> [2]:
>>>>https://docs.google.com/spreadsheets/d/1NYyMIDaUnlOD2wWuioAEOhjeVmZe7Q8
>>>>_z
>>>>dFfuLjquG4/edit#gid=0
>>>>
>>>> David Moreau Simard
>>>> Senior Software Engineer | Openstack RDO
>>>>
>>>> dmsimard = [irc, github, twitter]
>>>>
>>>>
>>>>_______________________________________________________________________
>>>>__
>>>>_
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe:
>>>>OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>
>>>________________________________________________________________________
>>>__
>>>OpenStack Development Mailing List (not for usage questions)
>>>Unsubscribe: 
>>>OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> 
>>_________________________________________________________________________
>>_
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: 
>>OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>__________________________________________________________________________
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list