[openstack-dev] [kolla] Stability and reliability of gate jobs

David Moreau Simard dms at redhat.com
Mon Jul 4 19:39:09 UTC 2016


I mentioned this on IRC to some extent but I'm going to post it here
for posterity.

I think we can all agree that Integration tests are pretty darn
important and I'm convinced I don't need to remind you why.
I'm going to re-iterate that I am very concerned about the state of
the jobs but also their coverage.

Kolla provides an implementation for a lot of the big tents projects
but they are not properly (if at all) tested in the gate.
Only the core services are tested in an "all-in-one" fashion and if a
commit happens to break a project that isn't tested in that all-in-one
test, no one will know about it.

This is very dangerous territory -- you can't guarantee that what
Kolla supports really works on every commit.
Both Packstack [1] and Puppet-OpenStack [2] have an extensive matrix
of test coverage across different jobs and different operating systems
to work around the memory constraints of the gate virtual machines.
They test themselves with their project implementations in different
ways (i.e, glance with file, glance with swift, cinder with lvm,
cinder with ceph, neutron with ovs, neutron with linuxbridge, etc.)
and do so successfully.

I don't see why Kolla should be different if it is to be taken seriously.
My apologies if it feels I am being harsh - I am being open and honest
about Kolla's loss of credibility from my perspective.

I've put my attempts to put Kolla in RDO's testing pipeline on hold
for the Newton cycle.
I hope we can straighten out all of this -- I care about Kolla and I
want it to succeed, which is why I started this thread in the first
place.

While I don't really have the bandwidth to contribute to Kolla, I hope
you can at least consider my feedback and you can also find me on IRC
if you have questions.

[1]: https://github.com/openstack/packstack#packstack-integration-tests
[2]: https://github.com/openstack/puppet-openstack-integration#description

David Moreau Simard
Senior Software Engineer | Openstack RDO

dmsimard = [irc, github, twitter]


On Thu, Jun 16, 2016 at 8:20 AM, Steven Dake (stdake) <stdake at cisco.com> wrote:
> David,
>
> The gates are unreliable for a variety of reasons - some we can fix - some
> we can't directly.
>
> RDO rabbitmq introduced IPv6 support to erlang, which caused our gate
> reliably to drop dramatically.  Prior to this change, our gate was running
> 95% reliability or better - assuming the code wasn¹t busted.
> The gate gear is different - meaning different setup.  We have been
> working on debugging all these various gate provider issues with infra
> team and I think that is mostly concluded.
> The gate changed to something called bindeps which has been less reliable
> for us.
> We do not have mirrors of CentOS repos - although it is in the works.
> Mirrors will ensure that images always get built.  At the moment many of
> the gate failures are triggered by build failures (the mirrors are too
> busy).
> We do not have mirrors of the other 5-10 repos and files we use.  This
> causes more build failures.
>
> Complicating matters, any of theses 5 things above can crater one gate job
> of which we run about 15 jobs, which causes the entire gate to fail (if
> they were voting).  I really want a voting gate for kolla's jobs.  I super
> want it.  The reason we can't make the gates voting at this time is
> because of the sheer unreliability of the gate.
>
> If anyone is up for a thorough analysis of *why* the gates are failing,
> that would help us fix them.
>
> Regards
> -steve
>
> On 6/15/16, 3:27 AM, "Paul Bourke" <paul.bourke at oracle.com> wrote:
>
>>Hi David,
>>
>>I agree with this completely. Gates continue to be a problem for Kolla,
>>reasons why have been discussed in the past but at least for me it's not
>>clear what the key issues are.
>>
>>I've added this item to agenda for todays IRC meeting (16:00 UTC -
>>https://wiki.openstack.org/wiki/Meetings/Kolla). It may help if before
>>hand we can brainstorm a list of the most common problems here beforehand.
>>
>>To kick things off, rabbitmq seems to cause a disproportionate amount of
>>issues, and the problems are difficult to diagnose, particularly when
>>the only way to debug is to summit "DO NOT MERGE" patch sets over and
>>over. Here's an example of a failed centos binary gate from a simple
>>patch set I was reviewing this morning:
>>http://logs.openstack.org/06/329506/1/check/gate-kolla-dsvm-deploy-centos-
>>binary/3486d03/console.html#_2016-06-14_15_36_19_425413
>>
>>Cheers,
>>-Paul
>>
>>On 15/06/16 04:26, David Moreau Simard wrote:
>>> Hi Kolla o/
>>>
>>> I'm writing to you because I'm concerned.
>>>
>>> In case you didn't already know, the RDO community collaborates with
>>> upstream deployment and installation projects to test it's packaging.
>>>
>>> This relationship is beneficial in a lot of ways for both parties, in
>>>summary:
>>> - RDO has improved test coverage (because it's otherwise hard to test
>>> different ways of installing, configuring and deploying OpenStack by
>>> ourselves)
>>> - The RDO community works with upstream projects (deployment or core
>>> projects) to fix issues that we find
>>> - In return, the collaborating deployment project can feel more
>>> confident that the RDO packages it consumes have already been tested
>>> using it's platform and should work
>>>
>>> To make a long story short, we do this with a project called WeIRDO
>>> [1] which essentially runs gate jobs outside of the gate.
>>>
>>> I tried to get Kolla in our testing pipeline during the Mitaka cycle.
>>> I really did.
>>> I contributed the necessary features I needed in Kolla in order to
>>> make this work, like the configurable Yum repositories for example.
>>>
>>> However, in the end, I had to put off the initiative because the gate
>>> jobs were very flappy and unreliable.
>>> We cannot afford to have a job that is *expected* to flap in our
>>> testing pipeline, it leads to a lot of wasted time, effort and
>>> resources.
>>>
>>> I think there's been a lot of improvements since my last attempt but
>>> to get a sample of data, I looked at ~30 recently merged reviews.
>>> Of 260 total build/deploy jobs, 55 (or over 20%) failed -- and I
>>> didn't account for rechecks, just the last known status of the check
>>> jobs.
>>> I put up the results of those jobs here [2].
>>>
>>> In the case that interests me most, CentOS binary jobs, it's 5
>>> failures out of 50 jobs, so 10%. Not as bad but still a concern for
>>> me.
>>>
>>> Other deployment projects like Puppet-OpenStack, OpenStack Ansible,
>>> Packstack and TripleO have quite a bit of *voting* integration testing
>>> jobs.
>>> Why are Kolla's jobs non-voting and so unreliable ?
>>>
>>> Thanks,
>>>
>>> [1]: https://github.com/rdo-infra/weirdo
>>> [2]:
>>>https://docs.google.com/spreadsheets/d/1NYyMIDaUnlOD2wWuioAEOhjeVmZe7Q8_z
>>>dFfuLjquG4/edit#gid=0
>>>
>>> David Moreau Simard
>>> Senior Software Engineer | Openstack RDO
>>>
>>> dmsimard = [irc, github, twitter]
>>>
>>>
>>>_________________________________________________________________________
>>>_
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>>OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>__________________________________________________________________________
>>OpenStack Development Mailing List (not for usage questions)
>>Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list