[openstack-dev] [tripleo] CI jobs failures

Derek Higgins derekh at redhat.com
Mon Mar 7 15:24:50 UTC 2016


On 6 March 2016 at 16:58, James Slagle <james.slagle at gmail.com> wrote:
> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi <emilien at redhat.com> wrote:
>> I'm kind of hijacking Dan's e-mail but I would like to propose some
>> technical improvements to stop having so much CI failures.
>>
>>
>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible
>> mistake to swap on files because we don't have enough RAM. In my
>> experience, swaping on non-SSD disks is even worst that not having
>> enough RAM. We should stop doing that I think.
>
> We have been relying on swap in tripleo-ci for a little while. While
> not ideal, it has been an effective way to at least be able to test
> what we've been testing given the amount of physical RAM that is
> available.

Ok, so I have a few points here, in places where I'm making
assumptions I'll try to point it out

o Yes I agree using swap should be avoided if at all possible

o We are currently looking into adding more RAM to our testenv hosts,
it which point we can afford to be a little more liberal with Memory
and this problem should become less of an issue, having said that

o Even though using swap is bad, if we have some processes with a
large Mem footprint that don't require constant access to a portion of
the footprint swaping it out over the duration of the CI test isn't as
expensive as it would suggest (assuming it doesn't need to be swapped
back in and the kernel has selected good candidates to swap out)

o The test envs that host the undercloud and overcloud nodes have 64G
of RAM each, they each host 4 testenvs and each test env if running a
HA job can use up to 21G of RAM so we have over committed there, it
this is only a problem if a test env host gets 4 HA jobs that are
started around the same time (and as a result a each have 4 overcloud
nodes running at the same time), to allow this to happen without VM's
being killed by the OOM we've also enabled swap there. The majority of
the time this swap isn't in use, only if all 4 testenvs are being
simultaneously used and they are all running the second half of a CI
test at the same time.

o The overcloud nodes are VM's running with a "unsafe" disk caching
mechanism, this causes sync requests from guest to be ignored and as a
result if the instances being hosted on these nodes are going into
swap this swap will be cached on the host as long as RAM is available.
i.e. swap being used in the undercloud or overcloud isn't being synced
to the disk on the host unless it has to be.

o What I'd like us to avoid is simply bumping up the memory every time
we hit a OOM error without at least
  1. Explaining why we need more memory all of a sudden
  2. Looking into a way we may be able to avoid simply bumping the RAM
(at peak times we are memory constrained)

as an example, Lets take a look at the swap usage on the undercloud of
a recent ci nonha job[1][2], These insances have 5G of RAM with 2G or
swap enabled via a swapfile
the overcloud deploy started @22:07:46 and finished at @22:28:06

In the graph you'll see a spike in memory being swapped out around
22:09, this corresponds almost exactly to when the overcloud image is
being downloaded from swift[3], looking the top output at the end of
the test you'll see that swift-proxy is using over 500M of Mem[4].

I'd much prefer we spend time looking into why the swift proxy is
using this much memory rather then blindly bump the memory allocated
to the VM, perhaps we have something configured incorrectly or we've
hit a bug in swift.

Having said all that we can bump the memory allocated to each node but
we have to accept 1 of 2 possible consequences
1. We'll env up using the swap on the testenv hosts more then we
currently are or
2. We'll have to reduce the number of test envs per host from 4 down
to 3, wiping 25% of our capacity

[1] - http://logs.openstack.org/85/289085/2/check-tripleo/gate-tripleo-ci-f22-nonha/6fda33c/
[2] - http://goodsquishy.com/downloads/20160307/swap.png
[3] - 22:09:03 21678 INFO [-] Master cache miss for image
b6a96213-7955-4c4d-829e-871350939e03, starting download
      22:09:41 21678 DEBUG [-] Running cmd (subprocess): qemu-img info
/var/lib/ironic/master_images/tmpvjAlCU/b6a96213-7955-4c4d-829e-871350939e03.part
[4] - 17690 swift     20   0  804824 547724   1780 S   0.0 10.8
0:04.82 swift-prox+


>
> The recent change to add swap to the overcloud nodes has proved to be
> unstable. But that has more to do with it being racey with the
> validation deployment afaict. There are some patches currently up to
> address those issues.
>
>>
>>
>> 2/ Split CI jobs in scenarios.
>>
>> Currently we have CI jobs for ceph, HA, non-ha, containers and the
>> current situation is that jobs fail randomly, due to performances issues.

We don't know it due to performance issues, Your probably correct that
we wouldn't see them if we were allocating more resources to the ci
tests but this just means we have timing issues that are more
prevalent when resource constrained, I think that answer here is for
somebody to spend the time root cause each false negative we get and
fix where appropriate and then keep doing it, timing issues will
continue to sneak in if we're not keeping on top of it we get into the
recheck hell we're currently in.

>>
>> Puppet OpenStack CI had the same issue where we had one integration job
>> and we never stopped adding more services until all becomes *very*
>> unstable. We solved that issue by splitting the jobs and creating scenarios:
>>
>> https://github.com/openstack/puppet-openstack-integration#description
>>
>> What I propose is to split TripleO jobs in more jobs, but with less
>> services.
>>
>> The benefit of that:
>>
>> * more services coverage
>> * jobs will run faster
>> * less random issues due to bad performances
>>
>> The cost is of course it will consume more resources.
>> That's why I suggest 3/.
>>
>> We could have:
>>
>> * HA job with ceph and a full compute scenario (glance, nova, cinder,
>> ceilometer, aodh & gnocchi).
>> * Same with IPv6 & SSL.
>> * HA job without ceph and full compute scenario too
>> * HA job without ceph and basic compute (glance and nova), with extra
>> services like Trove, Sahara, etc.
>> * ...
>> (note: all jobs would have network isolation, which is to me a
>> requirement when testing an installer like TripleO).
>
> Each of those jobs would at least require as much memory as our
> current HA job. I don't see how this gets us to using less memory. The
> HA job we have now already deploys the minimal amount of services that
> is possible given our current architecture. Without the composable
> service roles work, we can't deploy less services than we already are.

Ya, this seems like an increase in the amount of resource usage, it
may be doable when we have the increated RAM in place, so I think once
we have the extra capacity it would be a worthwhile task to revisit
how many jobs we run and exactly what their doing to see if we can get
more coverage like you suggest

>
>
>
>>
>> 3/ Drop non-ha job.
>> I'm not sure why we have it, and the benefit of testing that comparing
>> to HA.
>
> In my opinion, I actually think that we could drop the ceph and non-ha
> job from the check-tripleo queue.
>
> non-ha doesn't test anything realistic, and it doesn't really provide
> any faster feedback on patches. It seems at most it might run 15-20
> minutes faster than the HA job on average. Sometimes it even runs
> slower than the HA job.
>
> The ceph job we could move to the experimental queue to run on demand
> on patches that might affect ceph, and it could also be a daily
> periodic job.
>
> The same could be done for the containers job, an IPv6 job, and an
> upgrades job. Ideally with a way to run an individual job as needed.
> Would we need different experimental queues to do that?
>
> That would leave only the HA job in the check queue, which we should
> run with SSL and network isolation. We could deploy less testenv's
> since we'd have less jobs running, but give the ones we do deploy more
> RAM. I think this would really alleviate a lot of the transient
> intermittent failures we get in CI currently. It would also likely run
> faster.
>
> It's probably worth seeking out some exact evidence from the RDO
> centos-ci, because I think they are testing with virtual environments
> that have a lot more RAM than tripleo-ci does. It'd be good to
> understand if they have some of the transient failures that tripleo-ci
> does as well.
>
> We really are deploying on the absolute minimum cpu/ram requirements
> that is even possible. I think it's unrealistic to expect a lot of
> stability in that scenario. And I think that's a big reason why we get
> so many transient failures.
>
> In summary: give the testenv's more ram, have one job in the
> check-tripleo queue, as many jobs as needed in the experimental queue,
> and as many periodic jobs as necessary.

At the face of it this seems like a good option until we want to test
multiple mutually exclusive features, we're then back to needing
multiple jobs.
Although we should probably do the matrix of what we want to test,
maybe we can eliminate 1 job.

>
>
>>
>>
>> Any comment / feedback is welcome,
>> --
>> Emilien Macchi
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
>
> --
> -- James Slagle
> --
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list