[openstack-dev] High failure rate for Nova Tempest suite

Sean Dague sdague at linux.vnet.ibm.com
Thu Jan 10 15:55:28 UTC 2013


On 01/10/2013 08:10 AM, Daniel P. Berrange wrote:
> In the past 2 days the Jenkins failure rate for the Tempest testsuite
> seems to have become incredibly high - seemingly over 50% failure rate.
> While we can trigger a rerun using 'recheck', this causes delays when
> you have run recheck multiple times for many patches.
>
> Does anyone know what's going on with Tempest right now & more importantly
> if there's a way to fix it ?

The crux of this is Tempest drives OpenStack CI guests just below their 
tipping point on a good day, i.e. creating guests to within a few 
hundred MB of the environment (test nodes run with 4G memory).

What that means is we have to be really careful on immediate resource 
cleanup. As resource delete (servers, images, volumes, everything) is 
async, and we are running so close to high water mark, if a delete isn't 
waited for, we can overrun, nova/cinder throws a scheduler warning that 
it couldn't start something, and tests start failing because prereqs 
didn't actually happen.

On a busy zuul day, especially where large patch series go in (which the 
last two days have been), CI runs slower, and new fail points emerge 
where we were doing the deletes async.

Addition eyes and hands on Tempest would help clean this up more, as 
always, patches welcome.

	-Sean

-- 
Sean Dague
IBM Linux Technology Center
email: sdague at linux.vnet.ibm.com
alt-email: sldague at us.ibm.com




More information about the OpenStack-dev mailing list