<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Feb 25, 2015 at 6:11 AM, Jeremy Stanley <span dir="ltr"><<a href="mailto:fungi@yuggoth.org" target="_blank">fungi@yuggoth.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 2015-02-25 01:02:07 +0530 (+0530), Bharat Kumar wrote:<br>

[...]<br>

<span class="">> After running 971 test cases VM inaccessible for 569 ticks<br>

</span>[...]<br>

<br>

Glad you're able to reproduce it. For the record that is running<br>

their 8GB performance flavor with a CentOS 7 PVHVM base image. The<br></blockquote><div><br></div><div>So we had 2 runs in total in the rax provider VM and below are the results:<br><br></div><div>Run 1) It failed and re-created the OOM. The setup had glusterfs as a storage<br></div><div>backend for Cinder.<br><br>[deepakcs@deepakcs r6-jeremy-rax-vm]$ grep oom-killer run1-w-gluster/logs/syslog.txt <br>Feb 24 18:41:08 <a href="http://devstack-centos7-rax-dfw-979654.slave.openstack.org">devstack-centos7-rax-dfw-979654.slave.openstack.org</a> kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0<br><br></div><div>Run 2) We <b>removed glusterfs backend</b>, so Cinder was configured with the default<br></div><div>storage backend i.e. LVM. <b><u>We re-created the OOM here too</u></b><br></div><div><br>So that proves that glusterfs doesn't cause it, as its happening without glusterfs too.<br>The VM (104.239.136.99) is now in such a bad shape that existing ssh sessions<br>are no longer responding for a long long time now, tho' ping works. So need someone to <br></div><div>help reboot/restart the VM so that we can collect the logs for records. Couldn't find anyone<br></div><div>during apac TZ to get it reboot.<br></div><div><br>We managed to get the below grep to work after a long time from another terminal<br></div><div>to prove that oom did happen for run2<br><br>bash-4.2$ sudo cat /var/log/messages| grep oom-killer<br>Feb 25 08:53:16 devstack-centos7-rax-dfw-979654 kernel: ntpd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0<br>Feb 25 09:03:35 devstack-centos7-rax-dfw-979654 kernel: beam.smp invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0<br>Feb 25 09:57:28 devstack-centos7-rax-dfw-979654 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0<br>Feb 25 10:40:38 devstack-centos7-rax-dfw-979654 kernel: mysqld invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0</div><div><br></div><br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

steps to recreate are <a href="http://paste.openstack.org/show/181303/" target="_blank">http://paste.openstack.org/show/181303/</a> as<br>

discussed in IRC (for the sake of others following along). I've held<br>

a similar worker in HPCloud (15.126.235.20) which is a 30GB flavor<br></blockquote><div><br></div><div>We ran 2 runs in total in the hpcloud provider VM (and this time it was setup correctly with 8g ram, as evident from /proc/meminfo as well as dstat output)<br><br></div><div>Run1) It was successfull. The setup had glusterfs as a storage<br>backend for Cinder. Only 2 testcases failed, they were expected. No oom happened.<br><br>[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ grep oom-killer run1-w-gluster/logs/syslog.txt <br>[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ <br><br></div><div>Run 2) Since run1 went fine, we enabled tempest volume backup testcases too and ran again.<br></div><div>It was successfull and no oom happened.<br><br>[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$ grep oom-killer run2-w-gluster/logs/syslog.txt <br>[deepakcs@deepakcs r7-jeremy-hpcloud-vm]$<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

artifically limited to 8GB through a kernel boot parameter.<br>

Hopefully following the same steps there will help either confirm<br>

the issue isn't specific to running in one particular service<br>

provider, or will yield some useful difference which could help<br>

highlight the cause.<br></blockquote><div><br></div><div>So from the above we can conclude that the tests are running fine on hpcloud<br></div><div>and not on rax provider. Since the OS (centos7) inside the VM across provider is same, <br>this now boils down to some issue with rax provider VM + centos7 combination.<br><br></div><div>Another data point I could gather is:<br></div><div>    The only other centos7 job we have is check-tempest-dsvm-centos7 and it does not run full tempest<br></div><div>looking at the job's config it only runs smoke tests (also confirmed the same with Ian W) which i believe<br></div><div>is a subset of tests only.<br><br></div><div>So that brings to the conclusion that probably cinder-glusterfs CI job (check-tempest-dsvm-full-glusterfs-centos7) is the first centos7<br></div><div>based job running full tempest tests in upstream CI and hence is the first to hit the issue , but on rax provider only<br><br></div><div>thanx,<br></div><div>deepak<br></div><div> <br></div></div><br></div></div>