[openstack-dev] [Magnum] gate issues
Guz Egor
guz_egor at yahoo.com
Fri Feb 5 07:44:05 UTC 2016
Corey, I think we should do more investigation before applying any "hot" patches. E.g. I look at several failures today and honestly there is no way to find out reasons.I believe we are not copying logs (https://github.com/openstack/magnum/blob/master/magnum/tests/functional/python_client_base.py#L163) during test failure, we register handler at setUp (https://github.com/openstack/magnum/blob/master/magnum/tests/functional/python_client_base.py#L244), but Swarm tests, createbay in setUpClass (https://github.com/openstack/magnum/blob/master/magnum/tests/functional/swarm/test_swarm_python_client.py#L48) which called before setUp.So there is no way to see any logs from vm.
sorry, I cannot submit patch/debug by myself because I will get my laptop back only on Tue ):
--- Egor
From: Corey O'Brien <coreypobrien at gmail.com>
To: OpenStack Development Mailing List (not for usage questions) <openstack-dev at lists.openstack.org>
Sent: Thursday, February 4, 2016 9:03 PM
Subject: [openstack-dev] [Magnum] gate issues
So as we're all aware, the gate is a mess right now. I wanted to sum up some of the issues so we can figure out solutions.
1. The functional-api job sometimes fails because bays timeout building after 1 hour. The logs look something like this:magnum.tests.functional.api.v1.test_bay.BayTest.test_create_list_and_delete_bays [3733.626171s] ... FAILEDI can reproduce this hang on my devstack with etcdctl 2.0.10 as described in this bug (https://bugs.launchpad.net/magnum/+bug/1541105), but apparently either my fix with using 2.2.5 (https://review.openstack.org/#/c/275994/) is incomplete or there is another intermittent problem because it happened again even with that fix: (http://logs.openstack.org/94/275994/1/check/gate-functional-dsvm-magnum-api/32aacb1/console.html)
2. The k8s job has some sort of intermittent hang as well that causes a similar symptom as with swarm. https://bugs.launchpad.net/magnum/+bug/1541964
3. When the functional-api job runs, it frequently destroys the VM causing the jenkins slave agent to die. Example: http://logs.openstack.org/03/275003/6/check/gate-functional-dsvm-magnum-api/a9a0eb9//console.htmlWhen this happens, zuul re-queues a new build from the start on a new VM. This can happen many times in a row before the job completes.I chatted with openstack-infra about this and after taking a look at one of the VMs, it looks like memory over consumption leading to thrashing was a possible culprit. The sshd daemon was also dead but the console showed things like "INFO: task kswapd0:77 blocked for more than 120 seconds". A cursory glance and following some of the jobs seems to indicate that this doesn't happen on RAX VMs which have swap devices unlike the OVH VMs as well.
4. In general, even when things work, the gate is really slow. The sequential master-then-node build process in combination with underpowered VMs makes bay builds take 25-30 minutes when they do succeed. Since we're already close to tipping over a VM, we run functional tests with concurrency=1, so 2 bay builds means almost the entire allotted devstack testing time (generally 75 minutes of actual test time available it seems).
Corey
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160205/2e994e01/attachment.html>
More information about the OpenStack-dev
mailing list