Open Stack

Mon May 19 15:53:38 UTC 2014

I was looking through this timeout bug [1] this morning and am able to 
correlate that around the time of the image snapshot timeout, ceilometer 
was really hammering CPU on the host.  There are already threads on 
ceilometer performance and how that needs to be improved for Tempest 
runs so I don't want to get into that here.

What I'm thinking about is if there is a way to be smarter about how we 
do timeouts in the tests, rather than just rely on globally configured 
hard-coded timeouts which are bound to fail intermittently in dynamic 
environments like this.

I'm thinking something along the lines of keeping track of CPU stats on 
intervals in our waiter loops, then when we reach our configured 
timeout, calculate the average CPU load/idle and if it falls below some 
threshold, we cut the timeout in half and redo the timeout loop - and we 
continue that until our timeout reaches some level that no longer makes 
sense, like once it drops less than a minute for example.

Are there other ideas here?  My main concern is the number of random 
timeout failures we see in the tests and then people are trying to 
fingerprint them with elastic-recheck but the queries are so generic 
they are not really useful.  We now put the test class and test case in 
the compute test timeout messages, but it's also not very useful to 
fingerprint every individual permutation of test class/case that we can 
hit a timeout in.

[1] https://bugs.launchpad.net/nova/+bug/1320617

-- 

Thanks,

Matt Riedemann

Open Stack

[openstack-dev] [qa] Smarter timeouts in Tempest?

OpenStack

Community

Documentation

Branding & Legal