[openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

Deepak Shetty dpkshetty at gmail.com
Fri Feb 20 15:29:31 UTC 2015


Hi Jeremy,
  Couldn't find anything strong in the logs to back the reason for OOM.
At the time OOM happens, mysqld and java processes have the most RAM hence
OOM selects mysqld (4.7G) to be killed.

>From a glusterfs backend perspective, i haven't found anything suspicious,
and we don't have the logs of glusterfs (which is typically in
/var/log/glusterfs) so can't delve inside glusterfs too much :(

BharatK (in CC) also tried to re-create the issue in local VM setup, but it
hasn't yet!

Having said that,* we do know* that we started seeing this issue after we
enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
to enable non-admin to create hyp-assisted snaps). We think that enabling
online snaps might have added to the number of tests and memory load &
thats the only clue we have as of now!

So :

  1) BharatK  has merged the patch (
https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the
glusterfs job. So no more nova-assisted-snap tests.

  2) We also are increasing the timeout of our job in patch (
https://review.openstack.org/#/c/157835/1 ) so that we can get a full run
without timeouts to do a good analysis of the logs (logs are not posted if
the job times out)

Can you please re-enable our job, so that we can confirm that disabling
online snap TCs is helping the issue, which if it does, can help us narrow
down the issue.

We also plan to monitor & debug over the weekend hence having the job
enabled can help us a lot.

thanx,
deepak


On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley <fungi at yuggoth.org> wrote:

> On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
> [...]
> > For some reason we are seeing the centos7 glusterfs CI job getting
> > aborted/ killed either by Java exception or the build getting
> > aborted due to timeout.
> [...]
> > Hoping to root cause this soon and get the cinder-glusterfs CI job
> > back online soon.
>
> I manually reran the same commands this job runs on an identical
> virtual machine and was able to reproduce some substantial
> weirdness.
>
> I temporarily lost remote access to the VM around 108 minutes into
> running the job (~17:50 in the logs) and the out of band console
> also became unresponsive to carriage returns. The machine's IP
> address still responded to ICMP ping, but attempts to open new TCP
> sockets to the SSH service never got a protocol version banner back.
> After about 10 minutes of that I went out to lunch but left
> everything untouched. To my excitement it was up and responding
> again when I returned.
>
> It appears from the logs that it runs well past the 120-minute mark
> where devstack-gate tries to kill the gate hook for its configured
> timeout. Somewhere around 165 minutes in (18:47) you can see the
> kernel out-of-memory killer starts to kick in and kill httpd and
> mysqld processes according to the syslog. Hopefully this is enough
> additional detail to get you a start at finding the root cause so
> that we can reenable your job. Let me know if there's anything else
> you need for this.
>
> [1] http://fungi.yuggoth.org/tmp/logs.tar
> --
> Jeremy Stanley
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150220/3c40490f/attachment.html>


More information about the OpenStack-dev mailing list