<html><head></head><body><div style="color:#000; background-color:#fff; font-family:Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif;font-size:13px"><div id="yui_3_16_0_1_1454620984159_85345" dir="ltr"><span style="font-family: Arial; font-size: small;" id="yui_3_16_0_1_1454620984159_85447" class="">Corey,</span><span></span></div><div></div><div id="yui_3_16_0_1_1454620984159_85355"> </div><div id="yui_3_16_0_1_1454620984159_85355" dir="ltr">I think we should do more investigation before applying any "hot" patches. E.g. I look at several failures today and honestly there is no way to find out reasons.</div><div id="yui_3_16_0_1_1454620984159_85355" dir="ltr">I believe we are not copying logs (https://github.com/openstack/magnum/blob/master/magnum/tests/functional/python_client_base.py#L163) during test failure, </div><div id="yui_3_16_0_1_1454620984159_85355" dir="ltr">we register handler at setUp (https://github.com/openstack/magnum/blob/master/magnum/tests/functional/python_client_base.py#L244), but Swarm tests, create</div><div id="yui_3_16_0_1_1454620984159_85355" dir="ltr">bay in setUpClass (https://github.com/openstack/magnum/blob/master/magnum/tests/functional/swarm/test_swarm_python_client.py#L48) which called before setUp.</div><div id="yui_3_16_0_1_1454620984159_85355" dir="ltr">So there is no way to see any logs from vm.</div><div id="yui_3_16_0_1_1454620984159_85355"><br></div><div id="yui_3_16_0_1_1454620984159_85355" dir="ltr">sorry, I cannot submit patch/debug by myself because I will get my laptop back only on Tue ):</div><div id="yui_3_16_0_1_1454620984159_85355"><br></div><div class="signature" id="yui_3_16_0_1_1454620984159_85354">--- </div><div class="signature" id="yui_3_16_0_1_1454620984159_85354"> Egor</div><div class="qtdSeparateBR" id="yui_3_16_0_1_1454620984159_85346"><br></div><div class="yahoo_quoted" id="yui_3_16_0_1_1454620984159_85351" style="display: block;"> <div style="font-family: Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size: 13px;" id="yui_3_16_0_1_1454620984159_85350"> <div style="font-family: HelveticaNeue, Helvetica Neue, Helvetica, Arial, Lucida Grande, sans-serif; font-size: 16px;" id="yui_3_16_0_1_1454620984159_85349"> <div dir="ltr" id="yui_3_16_0_1_1454620984159_85348"> <font size="2" face="Arial" id="yui_3_16_0_1_1454620984159_85347"> <hr size="1" id="yui_3_16_0_1_1454620984159_85353"> <b><span style="font-weight:bold;">From:</span></b> Corey O'Brien <coreypobrien@gmail.com><br> <b><span style="font-weight: bold;">To:</span></b> OpenStack Development Mailing List (not for usage questions) <openstack-dev@lists.openstack.org> <br> <b><span style="font-weight: bold;">Sent:</span></b> Thursday, February 4, 2016 9:03 PM<br> <b><span style="font-weight: bold;">Subject:</span></b> [openstack-dev] [Magnum] gate issues<br> </font> </div> <div class="y_msg_container" id="yui_3_16_0_1_1454620984159_85352"><br><div id="yiv3432290026"><div dir="ltr" id="yui_3_16_0_1_1454620984159_85421">So as we're all aware, the gate is a mess right now. I wanted to sum up some of the issues so we can figure out solutions.<div id="yui_3_16_0_1_1454620984159_86205"><br></div><div id="yui_3_16_0_1_1454620984159_85424">1. The functional-api job sometimes fails because bays timeout building after 1 hour. The logs look something like this:</div><div id="yui_3_16_0_1_1454620984159_85541"><span style="color:rgb(0,0,0);font-family:sans-serif;line-height:normal;white-space:pre-wrap;" id="yui_3_16_0_1_1454620984159_86206">magnum.tests.functional.api.v1.test_bay.BayTest.test_create_list_and_delete_bays [3733.626171s] ... FAILED</span></div><div id="yui_3_16_0_1_1454620984159_85428">I can reproduce this hang on my devstack with etcdctl 2.0.10 as described in this bug (<a rel="nofollow" target="_blank" href="https://bugs.launchpad.net/magnum/+bug/1541105">https://bugs.launchpad.net/magnum/+bug/1541105</a>), but apparently either my fix with using 2.2.5 (<a rel="nofollow" target="_blank" href="https://review.openstack.org/#/c/275994/" id="yui_3_16_0_1_1454620984159_85546">https://review.openstack.org/#/c/275994/</a>) is incomplete or there is another intermittent problem because it happened again even with that fix: (<a rel="nofollow" target="_blank" href="http://logs.openstack.org/94/275994/1/check/gate-functional-dsvm-magnum-api/32aacb1/console.html" id="yui_3_16_0_1_1454620984159_85840">http://logs.openstack.org/94/275994/1/check/gate-functional-dsvm-magnum-api/32aacb1/console.html</a>)</div><div id="yui_3_16_0_1_1454620984159_85611"><br></div><div id="yui_3_16_0_1_1454620984159_85542">2. The k8s job has some sort of intermittent hang as well that causes a similar symptom as with swarm. <a rel="nofollow" target="_blank" href="https://bugs.launchpad.net/magnum/+bug/1541964">https://bugs.launchpad.net/magnum/+bug/1541964</a></div><div><br></div><div>3. When the functional-api job runs, it frequently destroys the VM causing the jenkins slave agent to die. Example: <a rel="nofollow" target="_blank" href="http://logs.openstack.org/03/275003/6/check/gate-functional-dsvm-magnum-api/a9a0eb9//console.html">http://logs.openstack.org/03/275003/6/check/gate-functional-dsvm-magnum-api/a9a0eb9//console.html</a></div><div id="yui_3_16_0_1_1454620984159_85456">When this happens, zuul re-queues a new build from the start on a new VM. This can happen many times in a row before the job completes.</div><div>I chatted with openstack-infra about this and after taking a look at one of the VMs, it looks like memory over consumption leading to thrashing was a possible culprit. The sshd daemon was also dead but the console showed things like "INFO: task kswapd0:77 blocked for more than 120 seconds". A cursory glance and following some of the jobs seems to indicate that this doesn't happen on RAX VMs which have swap devices unlike the OVH VMs as well.</div><div><br></div><div>4. In general, even when things work, the gate is really slow. The sequential master-then-node build process in combination with underpowered VMs makes bay builds take 25-30 minutes when they do succeed. Since we're already close to tipping over a VM, we run functional tests with concurrency=1, so 2 bay builds means almost the entire allotted devstack testing time (generally 75 minutes of actual test time available it seems).</div><div><br></div><div>Corey</div></div></div><br>__________________________________________________________________________<br>OpenStack Development Mailing List (not for usage questions)<br>Unsubscribe: <a ymailto="mailto:OpenStack-dev-request@lists.openstack.org" href="mailto:OpenStack-dev-request@lists.openstack.org">OpenStack-dev-request@lists.openstack.org</a>?subject:unsubscribe<br><a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br><br><br></div> </div> </div> </div></div></body></html>