<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty <span dir="ltr"><<a href="mailto:dpkshetty@gmail.com" target="_blank">dpkshetty@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div><div><div><div>Hi Jeremy,<br></div>  Couldn't find anything strong in the logs to back the reason for OOM.<br></div>At the time OOM happens, mysqld and java processes have the most RAM hence OOM selects mysqld (4.7G) to be killed.<br><br></div>From a glusterfs backend perspective, i haven't found anything suspicious, and we don't have the logs of glusterfs (which is typically in /var/log/glusterfs) so can't delve inside glusterfs too much :(<br><br></div><div>BharatK (in CC) also tried to re-create the issue in local VM setup, but it hasn't yet!<br></div><div><br>Having said that,<u><b> we do know</b></u> that we started seeing this issue after we enabled the nova-assisted-snapshot tests (by changing nova' s policy.json to enable non-admin to create hyp-assisted snaps). We think that enabling online snaps might have added to the number of tests and memory load & thats the only clue we have as of now!<br><br></div></div></div></blockquote><div><br></div><div>It looks like OOM killer hit while qemu was busy and during a ServerRescueTest. Maybe libvirt logs would be useful as well?</div><div><br></div><div>And I don't see any tempest tests calling assisted-volume-snapshots</div><div><br></div><div>Also this looks odd: Feb 19 18:47:16 <a href="http://devstack-centos7-rax-iad-916633.slave.openstack.org">devstack-centos7-rax-iad-916633.slave.openstack.org</a> libvirtd[3753]: missing __com.redhat_reason in disk io error event</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><div><div>So :<br><br>  1) BharatK  has merged the patch ( <a href="https://review.openstack.org/#/c/157707/" target="_blank">https://review.openstack.org/#/c/157707/</a> ) to revert the policy.json in the glusterfs job. So no more nova-assisted-snap tests.<br></div><br>  2) We also are increasing the timeout of our job in patch ( <a href="https://review.openstack.org/#/c/157835/1" target="_blank">https://review.openstack.org/#/c/157835/1</a> ) so that we can get a full run without timeouts to do a good analysis of the logs (logs are not posted if the job times out)<br><br></div>Can you please re-enable our job, so that we can confirm that disabling online snap TCs is helping the issue, which if it does, can help us narrow down the issue.<br><br>We also plan to monitor & debug over the weekend hence having the job enabled can help us a lot.<br><br>thanx,<br>deepak<br><div><div><br></div></div></div><div class=""><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley <span dir="ltr"><<a href="mailto:fungi@yuggoth.org" target="_blank">fungi@yuggoth.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:<br>

[...]<br>

<span>> For some reason we are seeing the centos7 glusterfs CI job getting<br>

</span>> aborted/ killed either by Java exception or the build getting<br>

> aborted due to timeout.<br>

[...]<br>

<span>> Hoping to root cause this soon and get the cinder-glusterfs CI job<br>

> back online soon.<br>

<br>

</span>I manually reran the same commands this job runs on an identical<br>

virtual machine and was able to reproduce some substantial<br>

weirdness.<br>

<br>

I temporarily lost remote access to the VM around 108 minutes into<br>

running the job (~17:50 in the logs) and the out of band console<br>

also became unresponsive to carriage returns. The machine's IP<br>

address still responded to ICMP ping, but attempts to open new TCP<br>

sockets to the SSH service never got a protocol version banner back.<br>

After about 10 minutes of that I went out to lunch but left<br>

everything untouched. To my excitement it was up and responding<br>

again when I returned.<br>

<br>

It appears from the logs that it runs well past the 120-minute mark<br>

where devstack-gate tries to kill the gate hook for its configured<br>

timeout. Somewhere around 165 minutes in (18:47) you can see the<br>

kernel out-of-memory killer starts to kick in and kill httpd and<br>

mysqld processes according to the syslog. Hopefully this is enough<br>

additional detail to get you a start at finding the root cause so<br>

that we can reenable your job. Let me know if there's anything else<br>

you need for this.<br>

<br>

[1] <a href="http://fungi.yuggoth.org/tmp/logs.tar" target="_blank">http://fungi.yuggoth.org/tmp/logs.tar</a><br>

<span><font color="#888888">--<br>

Jeremy Stanley<br>

<br>

__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</font></span></blockquote></div><br></div>

</div></div><br>__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

<br></blockquote></div><br></div></div>