<p dir="ltr"><br>

On Feb 21, 2015 12:26 AM, "Joe Gordon" <<a href="mailto:joe.gordon0@gmail.com">joe.gordon0@gmail.com</a>> wrote:<br>

><br>

><br>

><br>

> On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty <<a href="mailto:dpkshetty@gmail.com">dpkshetty@gmail.com</a>> wrote:<br>

>><br>

>> Hi Jeremy,<br>

>>   Couldn't find anything strong in the logs to back the reason for OOM.<br>

>> At the time OOM happens, mysqld and java processes have the most RAM hence OOM selects mysqld (4.7G) to be killed.<br>

>><br>

>> From a glusterfs backend perspective, i haven't found anything suspicious, and we don't have the logs of glusterfs (which is typically in /var/log/glusterfs) so can't delve inside glusterfs too much :(<br>

>><br>

>> BharatK (in CC) also tried to re-create the issue in local VM setup, but it hasn't yet!<br>

>><br>

>> Having said that, we do know that we started seeing this issue after we enabled the nova-assisted-snapshot tests (by changing nova' s policy.json to enable non-admin to create hyp-assisted snaps). We think that enabling online snaps might have added to the number of tests and memory load & thats the only clue we have as of now!<br>

>><br>

><br>

> It looks like OOM killer hit while qemu was busy and during a ServerRescueTest. Maybe libvirt logs would be useful as well?</p>

<p dir="ltr">Thanks for the data point, will look at this test to understand more what's happening</p>

<p dir="ltr">><br>

> And I don't see any tempest tests calling assisted-volume-snapshots</p>

<p dir="ltr">Maybe it still hasn't reached to it yet.</p>

<p dir="ltr">Thanks<br>

Deepak<br></p>

<p dir="ltr">><br>

> Also this looks odd: Feb 19 18:47:16 <a href="http://devstack-centos7-rax-iad-916633.slave.openstack.org">devstack-centos7-rax-iad-916633.slave.openstack.org</a> libvirtd[3753]: missing __com.redhat_reason in disk io error event<br>

><br>

>  <br>

>><br>

>> So :<br>

>><br>

>>   1) BharatK  has merged the patch ( <a href="https://review.openstack.org/#/c/157707/">https://review.openstack.org/#/c/157707/</a> ) to revert the policy.json in the glusterfs job. So no more nova-assisted-snap tests.<br>

>><br>

>>   2) We also are increasing the timeout of our job in patch ( <a href="https://review.openstack.org/#/c/157835/1">https://review.openstack.org/#/c/157835/1</a> ) so that we can get a full run without timeouts to do a good analysis of the logs (logs are not posted if the job times out)<br>

>><br>

>> Can you please re-enable our job, so that we can confirm that disabling online snap TCs is helping the issue, which if it does, can help us narrow down the issue.<br>

>><br>

>> We also plan to monitor & debug over the weekend hence having the job enabled can help us a lot.<br>

>><br>

>> thanx,<br>

>> deepak<br>

>><br>

>><br>

>> On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley <<a href="mailto:fungi@yuggoth.org">fungi@yuggoth.org</a>> wrote:<br>

>>><br>

>>> On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:<br>

>>> [...]<br>

>>> > For some reason we are seeing the centos7 glusterfs CI job getting<br>

>>> > aborted/ killed either by Java exception or the build getting<br>

>>> > aborted due to timeout.<br>

>>> [...]<br>

>>> > Hoping to root cause this soon and get the cinder-glusterfs CI job<br>

>>> > back online soon.<br>

>>><br>

>>> I manually reran the same commands this job runs on an identical<br>

>>> virtual machine and was able to reproduce some substantial<br>

>>> weirdness.<br>

>>><br>

>>> I temporarily lost remote access to the VM around 108 minutes into<br>

>>> running the job (~17:50 in the logs) and the out of band console<br>

>>> also became unresponsive to carriage returns. The machine's IP<br>

>>> address still responded to ICMP ping, but attempts to open new TCP<br>

>>> sockets to the SSH service never got a protocol version banner back.<br>

>>> After about 10 minutes of that I went out to lunch but left<br>

>>> everything untouched. To my excitement it was up and responding<br>

>>> again when I returned.<br>

>>><br>

>>> It appears from the logs that it runs well past the 120-minute mark<br>

>>> where devstack-gate tries to kill the gate hook for its configured<br>

>>> timeout. Somewhere around 165 minutes in (18:47) you can see the<br>

>>> kernel out-of-memory killer starts to kick in and kill httpd and<br>

>>> mysqld processes according to the syslog. Hopefully this is enough<br>

>>> additional detail to get you a start at finding the root cause so<br>

>>> that we can reenable your job. Let me know if there's anything else<br>

>>> you need for this.<br>

>>><br>

>>> [1] <a href="http://fungi.yuggoth.org/tmp/logs.tar">http://fungi.yuggoth.org/tmp/logs.tar</a><br>

>>> --<br>

>>> Jeremy Stanley<br>

>>><br>

>>> __________________________________________________________________________<br>

>>> OpenStack Development Mailing List (not for usage questions)<br>

>>> Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

>>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

>><br>

>><br>

>><br>

>> __________________________________________________________________________<br>

>> OpenStack Development Mailing List (not for usage questions)<br>

>> Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

>><br>

><br>

><br>

> __________________________________________________________________________<br>

> OpenStack Development Mailing List (not for usage questions)<br>

> Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

><br>

</p>