[openstack-dev] [qa] Smarter timeouts in Tempest?
Joe Gordon
joe.gordon0 at gmail.com
Tue May 20 03:24:46 UTC 2014
I see this as the result of two unrelated issues.
On Mon, May 19, 2014 at 8:53 AM, Matt Riedemann
<mriedem at linux.vnet.ibm.com>wrote:
> I was looking through this timeout bug [1] this morning and am able to
> correlate that around the time of the image snapshot timeout, ceilometer
> was really hammering CPU on the host. There are already threads on
> ceilometer performance and how that needs to be improved for Tempest runs
> so I don't want to get into that here.
>
> What I'm thinking about is if there is a way to be smarter about how we do
> timeouts in the tests, rather than just rely on globally configured
> hard-coded timeouts which are bound to fail intermittently in dynamic
> environments like this.
>
> I'm thinking something along the lines of keeping track of CPU stats on
> intervals in our waiter loops, then when we reach our configured timeout,
> calculate the average CPU load/idle and if it falls below some threshold,
> we cut the timeout in half and redo the timeout loop - and we continue that
> until our timeout reaches some level that no longer makes sense, like once
> it drops less than a minute for example
>
1. Our test environment is being pushed to its limits. In the past we have
seen things fail in strange ways when CPU idle % drops below 10%. To
address this we can do a few things:
* Better track when our test environment has low idle CPU (post
processing on gate jobs?)
* Make gate jobs use less CPU (ceilometer issues etc).
>
> Are there other ideas here? My main concern is the number of random
> timeout failures we see in the tests and then people are trying to
> fingerprint them with elastic-recheck but the queries are so generic they
> are not really useful. We now put the test class and test case in the
> compute test timeout messages, but it's also not very useful to fingerprint
> every individual permutation of test class/case that we can hit a timeout
> in.
>
2. OpenStack is hard to debug. If we, the developers, cannot figure out
what is failing then imagine how hard debugging is for non-openstack
developers. When we see these types of issues, we should work on making
the logs more useful so we can create better elastic-rececheck fingerprints.
>
> [1] https://bugs.launchpad.net/nova/+bug/1320617
>
> --
>
> Thanks,
>
> Matt Riedemann
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140519/209f68b0/attachment.html>
More information about the OpenStack-dev
mailing list