[openstack-dev] [tripleo] CI jobs failures
Derek Higgins
derekh at redhat.com
Mon Mar 7 18:00:40 UTC 2016
On 7 March 2016 at 12:11, John Trowbridge <trown at redhat.com> wrote:
>
>
> On 03/06/2016 11:58 AM, James Slagle wrote:
>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi <emilien at redhat.com> wrote:
>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>> technical improvements to stop having so much CI failures.
>>>
>>>
>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>>> mistake to swap on files because we don't have enough RAM. In my
>>> experience, swaping on non-SSD disks is even worst that not having
>>> enough RAM. We should stop doing that I think.
>>
>> We have been relying on swap in tripleo-ci for a little while. While
>> not ideal, it has been an effective way to at least be able to test
>> what we've been testing given the amount of physical RAM that is
>> available.
>>
>> The recent change to add swap to the overcloud nodes has proved to be
>> unstable. But that has more to do with it being racey with the
>> validation deployment afaict. There are some patches currently up to
>> address those issues.
>>
>>>
>>>
>>> 2/ Split CI jobs in scenarios.
>>>
>>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>>> current situation is that jobs fail randomly, due to performances issues.
>>>
>>> Puppet OpenStack CI had the same issue where we had one integration job
>>> and we never stopped adding more services until all becomes *very*
>>> unstable. We solved that issue by splitting the jobs and creating scenarios:
>>>
>>> https://github.com/openstack/puppet-openstack-integration#description
>>>
>>> What I propose is to split TripleO jobs in more jobs, but with less
>>> services.
>>>
>>> The benefit of that:
>>>
>>> * more services coverage
>>> * jobs will run faster
>>> * less random issues due to bad performances
>>>
>>> The cost is of course it will consume more resources.
>>> That's why I suggest 3/.
>>>
>>> We could have:
>>>
>>> * HA job with ceph and a full compute scenario (glance, nova, cinder,
>>> ceilometer, aodh & gnocchi).
>>> * Same with IPv6 & SSL.
>>> * HA job without ceph and full compute scenario too
>>> * HA job without ceph and basic compute (glance and nova), with extra
>>> services like Trove, Sahara, etc.
>>> * ...
>>> (note: all jobs would have network isolation, which is to me a
>>> requirement when testing an installer like TripleO).
>>
>> Each of those jobs would at least require as much memory as our
>> current HA job. I don't see how this gets us to using less memory. The
>> HA job we have now already deploys the minimal amount of services that
>> is possible given our current architecture. Without the composable
>> service roles work, we can't deploy less services than we already are.
>>
>>
>>
>>>
>>> 3/ Drop non-ha job.
>>> I'm not sure why we have it, and the benefit of testing that comparing
>>> to HA.
>>
>> In my opinion, I actually think that we could drop the ceph and non-ha
>> job from the check-tripleo queue.
>>
>> non-ha doesn't test anything realistic, and it doesn't really provide
>> any faster feedback on patches. It seems at most it might run 15-20
>> minutes faster than the HA job on average. Sometimes it even runs
>> slower than the HA job.
>>
>> The ceph job we could move to the experimental queue to run on demand
>> on patches that might affect ceph, and it could also be a daily
>> periodic job.
>>
>> The same could be done for the containers job, an IPv6 job, and an
>> upgrades job. Ideally with a way to run an individual job as needed.
>> Would we need different experimental queues to do that?
>>
>> That would leave only the HA job in the check queue, which we should
>> run with SSL and network isolation. We could deploy less testenv's
>> since we'd have less jobs running, but give the ones we do deploy more
>> RAM. I think this would really alleviate a lot of the transient
>> intermittent failures we get in CI currently. It would also likely run
>> faster.
>>
>> It's probably worth seeking out some exact evidence from the RDO
>> centos-ci, because I think they are testing with virtual environments
>> that have a lot more RAM than tripleo-ci does. It'd be good to
>> understand if they have some of the transient failures that tripleo-ci
>> does as well.
>>
>
> The HA job in RDO CI is also more unstable than nonHA, although this is
> usually not to do with memory contention. Most of the time that I see
> the HA job fail spuriously in RDO CI, it is because of the Nova
> scheduler race. I would bet that this race is the cause for the
> fluctuating amount of time jobs take as well, because the recovery
> mechanism for this is just to retry. Those retries can add 15 min. per
> retry to the deploy. In RDO CI there is a 60min. timeout for deploy as
> well. If we can't deploy to virtual machines in under an hour, to me
> that is a bug. (Note, I am speaking of `openstack overcloud deploy` when
> I say deploy, though start to finish can take less than an hour with
> decent CPUs)
>
> RDO CI uses the following layout:
> Undercloud: 12G RAM, 4 CPUs
> 3x Control Nodes: 4G RAM, 1 CPU
> Compute Node: 4G RAM, 1 CPU
We're currently using 4G overcloud nodes also, if we ever bump this
you'll probably have to also.
>
> Is there any ability in our current CI setup to auto-identify the cause
> of a failure? The nova scheduler race has some tell tale log snippets we
> could search for, and we could even auto-recheck jobs that hit known
We attempted this in the past, iirc we had some rules in elastic
recheck to catch some of the error patterns we were seeing at the
time, eventually that work was stalled here
https://review.openstack.org/#/c/98154/
Somebody at the time (don't remember who) then agreed to do some of
the dashboard changes needed but they mustn't have gotten the time to
do it. Maybe we could revisit it, who knows things might have changed
enough since then that the concerns raised no longer apply.
> issues. That combined with some record of how often we hit these known
> issues would be really helpful.
We can currently use logstash to find specific error patterns for
errors that make their way to the console log, so for a subset of bugs
we can see how often we hit them, this could also be improved by
stashing more into logstash.
>
>> We really are deploying on the absolute minimum cpu/ram requirements
>> that is even possible. I think it's unrealistic to expect a lot of
>> stability in that scenario. And I think that's a big reason why we get
>> so many transient failures.
>>
>> In summary: give the testenv's more ram, have one job in the
>> check-tripleo queue, as many jobs as needed in the experimental queue,
>> and as many periodic jobs as necessary.
>>
> +1 I like this idea.
>>
>>>
>>>
>>> Any comment / feedback is welcome,
>>> --
>>> Emilien Macchi
>>>
>>>
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list