[openstack-dev] [tripleo] CI jobs failures
Ben Nemec
openstack at nemebean.com
Mon Mar 7 18:12:40 UTC 2016
On 03/07/2016 12:00 PM, Derek Higgins wrote:
> On 7 March 2016 at 12:11, John Trowbridge <trown at redhat.com> wrote:
>>
>>
>> On 03/06/2016 11:58 AM, James Slagle wrote:
>>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi <emilien at redhat.com> wrote:
>>>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>>>> technical improvements to stop having so much CI failures.
>>>>
>>>>
>>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>>>> mistake to swap on files because we don't have enough RAM. In my
>>>> experience, swaping on non-SSD disks is even worst that not having
>>>> enough RAM. We should stop doing that I think.
>>>
>>> We have been relying on swap in tripleo-ci for a little while. While
>>> not ideal, it has been an effective way to at least be able to test
>>> what we've been testing given the amount of physical RAM that is
>>> available.
>>>
>>> The recent change to add swap to the overcloud nodes has proved to be
>>> unstable. But that has more to do with it being racey with the
>>> validation deployment afaict. There are some patches currently up to
>>> address those issues.
>>>
>>>>
>>>>
>>>> 2/ Split CI jobs in scenarios.
>>>>
>>>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>>>> current situation is that jobs fail randomly, due to performances issues.
>>>>
>>>> Puppet OpenStack CI had the same issue where we had one integration job
>>>> and we never stopped adding more services until all becomes *very*
>>>> unstable. We solved that issue by splitting the jobs and creating scenarios:
>>>>
>>>> https://github.com/openstack/puppet-openstack-integration#description
>>>>
>>>> What I propose is to split TripleO jobs in more jobs, but with less
>>>> services.
>>>>
>>>> The benefit of that:
>>>>
>>>> * more services coverage
>>>> * jobs will run faster
>>>> * less random issues due to bad performances
>>>>
>>>> The cost is of course it will consume more resources.
>>>> That's why I suggest 3/.
>>>>
>>>> We could have:
>>>>
>>>> * HA job with ceph and a full compute scenario (glance, nova, cinder,
>>>> ceilometer, aodh & gnocchi).
>>>> * Same with IPv6 & SSL.
>>>> * HA job without ceph and full compute scenario too
>>>> * HA job without ceph and basic compute (glance and nova), with extra
>>>> services like Trove, Sahara, etc.
>>>> * ...
>>>> (note: all jobs would have network isolation, which is to me a
>>>> requirement when testing an installer like TripleO).
>>>
>>> Each of those jobs would at least require as much memory as our
>>> current HA job. I don't see how this gets us to using less memory. The
>>> HA job we have now already deploys the minimal amount of services that
>>> is possible given our current architecture. Without the composable
>>> service roles work, we can't deploy less services than we already are.
>>>
>>>
>>>
>>>>
>>>> 3/ Drop non-ha job.
>>>> I'm not sure why we have it, and the benefit of testing that comparing
>>>> to HA.
>>>
>>> In my opinion, I actually think that we could drop the ceph and non-ha
>>> job from the check-tripleo queue.
>>>
>>> non-ha doesn't test anything realistic, and it doesn't really provide
>>> any faster feedback on patches. It seems at most it might run 15-20
>>> minutes faster than the HA job on average. Sometimes it even runs
>>> slower than the HA job.
>>>
>>> The ceph job we could move to the experimental queue to run on demand
>>> on patches that might affect ceph, and it could also be a daily
>>> periodic job.
>>>
>>> The same could be done for the containers job, an IPv6 job, and an
>>> upgrades job. Ideally with a way to run an individual job as needed.
>>> Would we need different experimental queues to do that?
>>>
>>> That would leave only the HA job in the check queue, which we should
>>> run with SSL and network isolation. We could deploy less testenv's
>>> since we'd have less jobs running, but give the ones we do deploy more
>>> RAM. I think this would really alleviate a lot of the transient
>>> intermittent failures we get in CI currently. It would also likely run
>>> faster.
>>>
>>> It's probably worth seeking out some exact evidence from the RDO
>>> centos-ci, because I think they are testing with virtual environments
>>> that have a lot more RAM than tripleo-ci does. It'd be good to
>>> understand if they have some of the transient failures that tripleo-ci
>>> does as well.
>>>
>>
>> The HA job in RDO CI is also more unstable than nonHA, although this is
>> usually not to do with memory contention. Most of the time that I see
>> the HA job fail spuriously in RDO CI, it is because of the Nova
>> scheduler race. I would bet that this race is the cause for the
>> fluctuating amount of time jobs take as well, because the recovery
>> mechanism for this is just to retry. Those retries can add 15 min. per
>> retry to the deploy. In RDO CI there is a 60min. timeout for deploy as
>> well. If we can't deploy to virtual machines in under an hour, to me
>> that is a bug. (Note, I am speaking of `openstack overcloud deploy` when
>> I say deploy, though start to finish can take less than an hour with
>> decent CPUs)
>>
>> RDO CI uses the following layout:
>> Undercloud: 12G RAM, 4 CPUs
>> 3x Control Nodes: 4G RAM, 1 CPU
>> Compute Node: 4G RAM, 1 CPU
> We're currently using 4G overcloud nodes also, if we ever bump this
> you'll probably have to also.
>
>>
>> Is there any ability in our current CI setup to auto-identify the cause
>> of a failure? The nova scheduler race has some tell tale log snippets we
>> could search for, and we could even auto-recheck jobs that hit known
>
> We attempted this in the past, iirc we had some rules in elastic
> recheck to catch some of the error patterns we were seeing at the
> time, eventually that work was stalled here
> https://review.openstack.org/#/c/98154/
> Somebody at the time (don't remember who) then agreed to do some of
> the dashboard changes needed but they mustn't have gotten the time to
> do it. Maybe we could revisit it, who knows things might have changed
> enough since then that the concerns raised no longer apply.
>
>> issues. That combined with some record of how often we hit these known
>> issues would be really helpful.
>
> We can currently use logstash to find specific error patterns for
> errors that make their way to the console log, so for a subset of bugs
> we can see how often we hit them, this could also be improved by
> stashing more into logstash.
I've been collecting queries here:
https://etherpad.openstack.org/p/tripleo-ci-logstash-queries
>
>>
>>> We really are deploying on the absolute minimum cpu/ram requirements
>>> that is even possible. I think it's unrealistic to expect a lot of
>>> stability in that scenario. And I think that's a big reason why we get
>>> so many transient failures.
>>>
>>> In summary: give the testenv's more ram, have one job in the
>>> check-tripleo queue, as many jobs as needed in the experimental queue,
>>> and as many periodic jobs as necessary.
>>>
>> +1 I like this idea.
>>>
>>>>
>>>>
>>>> Any comment / feedback is welcome,
>>>> --
>>>> Emilien Macchi
>>>>
>>>>
>>>> __________________________________________________________________________
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>
>>>
>>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
More information about the OpenStack-dev
mailing list