[openstack-dev] High failure rate for Nova Tempest suite
Sean Dague
sdague at linux.vnet.ibm.com
Thu Jan 10 15:55:28 UTC 2013
On 01/10/2013 08:10 AM, Daniel P. Berrange wrote:
> In the past 2 days the Jenkins failure rate for the Tempest testsuite
> seems to have become incredibly high - seemingly over 50% failure rate.
> While we can trigger a rerun using 'recheck', this causes delays when
> you have run recheck multiple times for many patches.
>
> Does anyone know what's going on with Tempest right now & more importantly
> if there's a way to fix it ?
The crux of this is Tempest drives OpenStack CI guests just below their
tipping point on a good day, i.e. creating guests to within a few
hundred MB of the environment (test nodes run with 4G memory).
What that means is we have to be really careful on immediate resource
cleanup. As resource delete (servers, images, volumes, everything) is
async, and we are running so close to high water mark, if a delete isn't
waited for, we can overrun, nova/cinder throws a scheduler warning that
it couldn't start something, and tests start failing because prereqs
didn't actually happen.
On a busy zuul day, especially where large patch series go in (which the
last two days have been), CI runs slower, and new fail points emerge
where we were doing the deletes async.
Addition eyes and hands on Tempest would help clean this up more, as
always, patches welcome.
-Sean
--
Sean Dague
IBM Linux Technology Center
email: sdague at linux.vnet.ibm.com
alt-email: sldague at us.ibm.com
More information about the OpenStack-dev
mailing list