[openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

Deepak Shetty dpkshetty at gmail.com
Wed Feb 25 18:18:33 UTC 2015


On Wed, Feb 25, 2015 at 8:42 PM, Deepak Shetty <dpkshetty at gmail.com> wrote:

>
>
> On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley <fungi at yuggoth.org> wrote:
>
>> On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
>> [...]
>> > Run 2) We removed glusterfs backend, so Cinder was configured with
>> > the default storage backend i.e. LVM. We re-created the OOM here
>> > too
>> >
>> > So that proves that glusterfs doesn't cause it, as its happening
>> > without glusterfs too.
>>
>> Well, if you re-ran the job on the same VM then the second result is
>> potentially contaminated. Luckily this hypothesis can be confirmed
>> by running the second test on a fresh VM in Rackspace.
>>
>
> Maybe true, but we did the same on hpcloud provider VM too and both time
> it ran successfully with glusterfs as the cinder backend. Also before
> starting
> the 2nd run, we did unstack and saw that free memory did go back to 5G+
> and then re-invoked your script, I believe the contamination could result
> in some
> additional testcase failures (which we did see) but shouldn't be related to
> whether system can OOM or not, since thats a runtime thing.
>
> I see that the VM is up again. We will execute the 2nd run afresh now and
> update
> here.
>

Ran tempest with configured with default backend i.e. LVM and was able to
recreate
the OOM issue, so running tempest without gluster against a fresh VM
reliably
recreates the OOM issue, snip below from syslog.

Feb 25 16:58:37 devstack-centos7-rax-dfw-979654 kernel: glance-api invoked
oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0

Had a discussion with clarkb on IRC and given that F20 is discontinued, F21
has issues with tempest (under debug by ianw)
and centos7 also has issues on rax (as evident from this thread), the only
option left is to go with ubuntu based CI job, which
BharatK is working on now.

thanx,
deepak


>
>
>>
>> > The VM (104.239.136.99) is now in such a bad shape that existing
>> > ssh sessions are no longer responding for a long long time now,
>> > tho' ping works. So need someone to help reboot/restart the VM so
>> > that we can collect the logs for records. Couldn't find anyone
>> > during apac TZ to get it reboot.
>> [...]
>>
>> According to novaclient that instance was in a "shutoff" state, and
>> so I had to nova reboot --hard to get it running. Looks like it's
>> back up and reachable again now.
>>
>
> Cool, thanks!
>
>
>>
>> > So from the above we can conclude that the tests are running fine
>> > on hpcloud and not on rax provider. Since the OS (centos7) inside
>> > the VM across provider is same, this now boils down to some issue
>> > with rax provider VM + centos7 combination.
>>
>> This certainly seems possible.
>>
>> > Another data point I could gather is:
>> >     The only other centos7 job we have is
>> > check-tempest-dsvm-centos7 and it does not run full tempest
>> > looking at the job's config it only runs smoke tests (also
>> > confirmed the same with Ian W) which i believe is a subset of
>> > tests only.
>>
>> Correct, so if we confirm that we can't successfully run tempest
>> full on CentOS 7 in both of our providers yet, we should probably
>> think hard about the implications on yesterday's discussion as to
>> whether to set the smoke version gating on devstack and
>> devstack-gate changes.
>>
>> > So that brings to the conclusion that probably cinder-glusterfs CI
>> > job (check-tempest-dsvm-full-glusterfs-centos7) is the first
>> > centos7 based job running full tempest tests in upstream CI and
>> > hence is the first to hit the issue, but on rax provider only
>>
>> Entirely likely. As I mentioned last week, we don't yet have any
>> voting/gating jobs running on the platform as far as I can tell, so
>> it's still very much in an experimental stage.
>>
>
> So is there a way for a job to ask for hpcloud affinity, since thats where
> our
> job ran well (faster and only 2 failures, which were expected) ? I am not
> sure
> how easy and time consuming it would be to root cause why centos7 + rax
> provider
> is causing oom.
>
> Alternatively do you recommend using some other OS as the base for our job
> F20 or F21 or ubuntu ? I assume that there are other Jobs in rax provider
> that
> run on Fedora or Ubuntu with full tempest and don't OOM, would you know ?
>
> thanx,
> deepak
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150225/ddb0f22c/attachment.html>


More information about the OpenStack-dev mailing list