<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 25, 2015 at 6:34 PM, Jeremy Stanley <span dir="ltr"><<a href="mailto:fungi@yuggoth.org" target="_blank">fungi@yuggoth.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 2015-02-25 17:02:34 +0530 (+0530), Deepak Shetty wrote:<br>

[...]<br>

<span class="">> Run 2) We removed glusterfs backend, so Cinder was configured with<br>

> the default storage backend i.e. LVM. We re-created the OOM here<br>

> too<br>

><br>

> So that proves that glusterfs doesn't cause it, as its happening<br>

> without glusterfs too.<br>

<br>

</span>Well, if you re-ran the job on the same VM then the second result is<br>

potentially contaminated. Luckily this hypothesis can be confirmed<br>

by running the second test on a fresh VM in Rackspace.<br></blockquote><div><br></div><div>Maybe true, but we did the same on hpcloud provider VM too and both time<br></div><div>it ran successfully with glusterfs as the cinder backend. Also before starting<br></div><div>the 2nd run, we did unstack and saw that free memory did go back to 5G+<br>and then re-invoked your script, I believe the contamination could result in some<br></div><div>additional testcase failures (which we did see) but shouldn't be related to<br></div><div>whether system can OOM or not, since thats a runtime thing.<br><br></div><div>I see that the VM is up again. We will execute the 2nd run afresh now and update<br></div><div>here.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> The VM (104.239.136.99) is now in such a bad shape that existing<br>

> ssh sessions are no longer responding for a long long time now,<br>

> tho' ping works. So need someone to help reboot/restart the VM so<br>

> that we can collect the logs for records. Couldn't find anyone<br>

> during apac TZ to get it reboot.<br>

</span>[...]<br>

<br>

According to novaclient that instance was in a "shutoff" state, and<br>

so I had to nova reboot --hard to get it running. Looks like it's<br>

back up and reachable again now.<br></blockquote><div><br></div><div>Cool, thanks!<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<span class=""><br>

> So from the above we can conclude that the tests are running fine<br>

> on hpcloud and not on rax provider. Since the OS (centos7) inside<br>

> the VM across provider is same, this now boils down to some issue<br>

> with rax provider VM + centos7 combination.<br>

<br>

</span>This certainly seems possible.<br>

<span class=""><br>

> Another data point I could gather is:<br>

>     The only other centos7 job we have is<br>

> check-tempest-dsvm-centos7 and it does not run full tempest<br>

> looking at the job's config it only runs smoke tests (also<br>

> confirmed the same with Ian W) which i believe is a subset of<br>

> tests only.<br>

<br>

</span>Correct, so if we confirm that we can't successfully run tempest<br>

full on CentOS 7 in both of our providers yet, we should probably<br>

think hard about the implications on yesterday's discussion as to<br>

whether to set the smoke version gating on devstack and<br>

devstack-gate changes.<br>

<span class=""><br>

> So that brings to the conclusion that probably cinder-glusterfs CI<br>

> job (check-tempest-dsvm-full-glusterfs-centos7) is the first<br>

> centos7 based job running full tempest tests in upstream CI and<br>

</span>> hence is the first to hit the issue, but on rax provider only<br>

<br>

Entirely likely. As I mentioned last week, we don't yet have any<br>

voting/gating jobs running on the platform as far as I can tell, so<br>

it's still very much in an experimental stage.<br></blockquote><div><br></div><div>So is there a way for a job to ask for hpcloud affinity, since thats where our <br></div><div>job ran well (faster and only 2 failures, which were expected) ? I am not sure<br></div><div>how easy and time consuming it would be to root cause why centos7 + rax provider<br></div><div>is causing oom.<br><br></div><div>Alternatively do you recommend using some other OS as the base for our job <br></div><div>F20 or F21 or ubuntu ? I assume that there are other Jobs in rax provider that<br></div><div>run on Fedora or Ubuntu with full tempest and don't OOM, would you know ?<br></div><div><br></div><div>thanx,<br></div><div>deepak<br></div><div> <br></div></div><br></div></div>