[openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

Joe Gordon joe.gordon0 at gmail.com
Fri Feb 20 18:49:29 UTC 2015


On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty <dpkshetty at gmail.com> wrote:

> Hi Jeremy,
>   Couldn't find anything strong in the logs to back the reason for OOM.
> At the time OOM happens, mysqld and java processes have the most RAM hence
> OOM selects mysqld (4.7G) to be killed.
>
> From a glusterfs backend perspective, i haven't found anything suspicious,
> and we don't have the logs of glusterfs (which is typically in
> /var/log/glusterfs) so can't delve inside glusterfs too much :(
>
> BharatK (in CC) also tried to re-create the issue in local VM setup, but
> it hasn't yet!
>
> Having said that,* we do know* that we started seeing this issue after we
> enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
> to enable non-admin to create hyp-assisted snaps). We think that enabling
> online snaps might have added to the number of tests and memory load &
> thats the only clue we have as of now!
>
>
It looks like OOM killer hit while qemu was busy and during
a ServerRescueTest. Maybe libvirt logs would be useful as well?

And I don't see any tempest tests calling assisted-volume-snapshots

Also this looks odd: Feb 19 18:47:16
devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing
__com.redhat_reason in disk io error event



> So :
>
>   1) BharatK  has merged the patch (
> https://review.openstack.org/#/c/157707/ ) to revert the policy.json in
> the glusterfs job. So no more nova-assisted-snap tests.
>
>   2) We also are increasing the timeout of our job in patch (
> https://review.openstack.org/#/c/157835/1 ) so that we can get a full run
> without timeouts to do a good analysis of the logs (logs are not posted if
> the job times out)
>
> Can you please re-enable our job, so that we can confirm that disabling
> online snap TCs is helping the issue, which if it does, can help us narrow
> down the issue.
>
> We also plan to monitor & debug over the weekend hence having the job
> enabled can help us a lot.
>
> thanx,
> deepak
>
>
> On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley <fungi at yuggoth.org>
> wrote:
>
>> On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
>> [...]
>> > For some reason we are seeing the centos7 glusterfs CI job getting
>> > aborted/ killed either by Java exception or the build getting
>> > aborted due to timeout.
>> [...]
>> > Hoping to root cause this soon and get the cinder-glusterfs CI job
>> > back online soon.
>>
>> I manually reran the same commands this job runs on an identical
>> virtual machine and was able to reproduce some substantial
>> weirdness.
>>
>> I temporarily lost remote access to the VM around 108 minutes into
>> running the job (~17:50 in the logs) and the out of band console
>> also became unresponsive to carriage returns. The machine's IP
>> address still responded to ICMP ping, but attempts to open new TCP
>> sockets to the SSH service never got a protocol version banner back.
>> After about 10 minutes of that I went out to lunch but left
>> everything untouched. To my excitement it was up and responding
>> again when I returned.
>>
>> It appears from the logs that it runs well past the 120-minute mark
>> where devstack-gate tries to kill the gate hook for its configured
>> timeout. Somewhere around 165 minutes in (18:47) you can see the
>> kernel out-of-memory killer starts to kick in and kill httpd and
>> mysqld processes according to the syslog. Hopefully this is enough
>> additional detail to get you a start at finding the root cause so
>> that we can reenable your job. Let me know if there's anything else
>> you need for this.
>>
>> [1] http://fungi.yuggoth.org/tmp/logs.tar
>> --
>> Jeremy Stanley
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150220/1c75264f/attachment.html>


More information about the OpenStack-dev mailing list