[openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

Deepak Shetty dpkshetty at gmail.com
Wed Feb 25 15:12:09 UTC 2015


On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley <fungi at yuggoth.org> wrote:

> On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:
> [...]
> > Run 2) We removed glusterfs backend, so Cinder was configured with
> > the default storage backend i.e. LVM. We re-created the OOM here
> > too
> >
> > So that proves that glusterfs doesn't cause it, as its happening
> > without glusterfs too.
>
> Well, if you re-ran the job on the same VM then the second result is
> potentially contaminated. Luckily this hypothesis can be confirmed
> by running the second test on a fresh VM in Rackspace.
>

Maybe true, but we did the same on hpcloud provider VM too and both time
it ran successfully with glusterfs as the cinder backend. Also before
starting
the 2nd run, we did unstack and saw that free memory did go back to 5G+
and then re-invoked your script, I believe the contamination could result
in some
additional testcase failures (which we did see) but shouldn't be related to
whether system can OOM or not, since thats a runtime thing.

I see that the VM is up again. We will execute the 2nd run afresh now and
update
here.


>
> > The VM (104.239.136.99) is now in such a bad shape that existing
> > ssh sessions are no longer responding for a long long time now,
> > tho' ping works. So need someone to help reboot/restart the VM so
> > that we can collect the logs for records. Couldn't find anyone
> > during apac TZ to get it reboot.
> [...]
>
> According to novaclient that instance was in a "shutoff" state, and
> so I had to nova reboot --hard to get it running. Looks like it's
> back up and reachable again now.
>

Cool, thanks!


>
> > So from the above we can conclude that the tests are running fine
> > on hpcloud and not on rax provider. Since the OS (centos7) inside
> > the VM across provider is same, this now boils down to some issue
> > with rax provider VM + centos7 combination.
>
> This certainly seems possible.
>
> > Another data point I could gather is:
> >     The only other centos7 job we have is
> > check-tempest-dsvm-centos7 and it does not run full tempest
> > looking at the job's config it only runs smoke tests (also
> > confirmed the same with Ian W) which i believe is a subset of
> > tests only.
>
> Correct, so if we confirm that we can't successfully run tempest
> full on CentOS 7 in both of our providers yet, we should probably
> think hard about the implications on yesterday's discussion as to
> whether to set the smoke version gating on devstack and
> devstack-gate changes.
>
> > So that brings to the conclusion that probably cinder-glusterfs CI
> > job (check-tempest-dsvm-full-glusterfs-centos7) is the first
> > centos7 based job running full tempest tests in upstream CI and
> > hence is the first to hit the issue, but on rax provider only
>
> Entirely likely. As I mentioned last week, we don't yet have any
> voting/gating jobs running on the platform as far as I can tell, so
> it's still very much in an experimental stage.
>

So is there a way for a job to ask for hpcloud affinity, since thats where
our
job ran well (faster and only 2 failures, which were expected) ? I am not
sure
how easy and time consuming it would be to root cause why centos7 + rax
provider
is causing oom.

Alternatively do you recommend using some other OS as the base for our job
F20 or F21 or ubuntu ? I assume that there are other Jobs in rax provider
that
run on Fedora or Ubuntu with full tempest and don't OOM, would you know ?

thanx,
deepak
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150225/b5ff79c4/attachment.html>


More information about the OpenStack-dev mailing list