[openstack-dev] Top Gate Reseting issues that need attention - Tuesday morning update

Sean Dague sean at dague.net
Tue Jan 21 12:14:25 UTC 2014


Brief update on where we stand on the gate (still not great)
 - gate is currently 126 deep
 - top of queue entered 51hrs ago

Bug 1270680 - v3 extensions api inherently racey wrt instances - patch
landed (seems to have helped though the exception is still showing up
quite a bit, so don't know if this is 100% fixed)

 - Thanks to Russell, Dan Smith, and Chris Yeoh for diving in here.

Bug 1270608 - possible proposed cinder patch revert to address this.
Still under discussion on right approach here.

 - Thanks to Matt Riedeman and John Griffith for diving in on this one.


Last night we merged this - https://review.openstack.org/#/c/67991/ -
which turns off the scenario test that was seeming to trigger volume
fails at a really high rate (> 25% chance). Which meant that across 4 -
6 devstack jobs, the chance of any patch hitting it was nearing 100%.
Turning that off seems to have helped, and we have merged a few real
changes since then. (19 merges in last 12 hrs on openstack/openstack).

Resets still seem to be largely focussed around Volumes related
failures. So our ability to allocate volumes reliably now seems highly
suspect. They may be showing up in a few different ways. Continued
diving in there would be great. The other option might be to keep
turning off volumes tests until we start passing again.


Additional efforts to help:

Nodepool latencies
 - Jeremy, Robert, and Jim have been looking at waits to let us recover
the nodes faster in nodepool. That may help speed up recovery more.


Thanks to everyone diving in right now. And a call for help on folks to
help work through these issues to get us back to normal.

On 01/20/2014 09:33 AM, Sean Dague wrote:
> Anyone that's looked at the gate this morning... knows things aren't
> good. It turns out that a few new races got into OpenStack last week,
> which are causing a ton of pain, and have put us dramatically over the edge.
> 
> We've not tracked down all of them, but 2 that are quite important to
> address are:
> 
>  - Bug 1270680 - v3 extensions api inherently racey wrt instances


>  - Bug 1270608 - n-cpu 'iSCSI device not found' log causes
> gate-tempest-dsvm-*-full to fail
> 
> Both can be seen as very new issues here -
> http://status.openstack.org/elastic-recheck/
> 
> We've got a short term work around on 1270680 which we're going to take
> into the gate now (and fix it better later).
> 
> 1270608 is still in desperate need of fixing.
> 
> 
> Neutron is in a whole other level of pain. Over the weekend I found the
> isolated jobs are in a 70% fail state, which means the overall chance
> for success for Neutron / Neutron client patches are < 5%. As such I'd
> suggest a moritorium for them going into the gate at this point, as they
> are basically guarunteed to fail.
> 
> 	-Sean
> 
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 


-- 
Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140121/b35024a1/attachment.pgp>


More information about the OpenStack-dev mailing list