[openstack-dev] [tripleo] CI jobs failures

Dmitry Tantsur dtantsur at redhat.com
Mon Mar 7 10:04:58 UTC 2016


On 03/06/2016 05:58 PM, James Slagle wrote:
> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi <emilien at redhat.com> wrote:
>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>> technical improvements to stop having so much CI failures.
>>
>>
>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>> mistake to swap on files because we don't have enough RAM. In my
>> experience, swaping on non-SSD disks is even worst that not having
>> enough RAM. We should stop doing that I think.
>
> We have been relying on swap in tripleo-ci for a little while. While
> not ideal, it has been an effective way to at least be able to test
> what we've been testing given the amount of physical RAM that is
> available.
>
> The recent change to add swap to the overcloud nodes has proved to be
> unstable. But that has more to do with it being racey with the
> validation deployment afaict. There are some patches currently up to
> address those issues.
>
>>
>>
>> 2/ Split CI jobs in scenarios.
>>
>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>> current situation is that jobs fail randomly, due to performances issues.
>>
>> Puppet OpenStack CI had the same issue where we had one integration job
>> and we never stopped adding more services until all becomes *very*
>> unstable. We solved that issue by splitting the jobs and creating scenarios:
>>
>> https://github.com/openstack/puppet-openstack-integration#description
>>
>> What I propose is to split TripleO jobs in more jobs, but with less
>> services.
>>
>> The benefit of that:
>>
>> * more services coverage
>> * jobs will run faster
>> * less random issues due to bad performances
>>
>> The cost is of course it will consume more resources.
>> That's why I suggest 3/.
>>
>> We could have:
>>
>> * HA job with ceph and a full compute scenario (glance, nova, cinder,
>> ceilometer, aodh & gnocchi).
>> * Same with IPv6 & SSL.
>> * HA job without ceph and full compute scenario too
>> * HA job without ceph and basic compute (glance and nova), with extra
>> services like Trove, Sahara, etc.
>> * ...
>> (note: all jobs would have network isolation, which is to me a
>> requirement when testing an installer like TripleO).
>
> Each of those jobs would at least require as much memory as our
> current HA job. I don't see how this gets us to using less memory. The
> HA job we have now already deploys the minimal amount of services that
> is possible given our current architecture. Without the composable
> service roles work, we can't deploy less services than we already are.
>
>
>
>>
>> 3/ Drop non-ha job.
>> I'm not sure why we have it, and the benefit of testing that comparing
>> to HA.
>
> In my opinion, I actually think that we could drop the ceph and non-ha
> job from the check-tripleo queue.
>
> non-ha doesn't test anything realistic, and it doesn't really provide
> any faster feedback on patches. It seems at most it might run 15-20
> minutes faster than the HA job on average. Sometimes it even runs
> slower than the HA job.

The non-HA job is the only job with introspection. So you'll have to 
enable introspection on the HA job, bumping its run time.

>
> The ceph job we could move to the experimental queue to run on demand
> on patches that might affect ceph, and it could also be a daily
> periodic job.
>
> The same could be done for the containers job, an IPv6 job, and an
> upgrades job. Ideally with a way to run an individual job as needed.
> Would we need different experimental queues to do that?
>
> That would leave only the HA job in the check queue, which we should
> run with SSL and network isolation. We could deploy less testenv's
> since we'd have less jobs running, but give the ones we do deploy more
> RAM. I think this would really alleviate a lot of the transient
> intermittent failures we get in CI currently. It would also likely run
> faster.
>
> It's probably worth seeking out some exact evidence from the RDO
> centos-ci, because I think they are testing with virtual environments
> that have a lot more RAM than tripleo-ci does. It'd be good to
> understand if they have some of the transient failures that tripleo-ci
> does as well.
>
> We really are deploying on the absolute minimum cpu/ram requirements
> that is even possible. I think it's unrealistic to expect a lot of
> stability in that scenario. And I think that's a big reason why we get
> so many transient failures.
>
> In summary: give the testenv's more ram, have one job in the
> check-tripleo queue, as many jobs as needed in the experimental queue,
> and as many periodic jobs as necessary.
>
>
>>
>>
>> Any comment / feedback is welcome,
>> --
>> Emilien Macchi
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
>




More information about the OpenStack-dev mailing list