[openstack-dev] [nova][neutron][qa] top gate bugs: a plea for help

David Kranz dkranz at redhat.com
Sun Jan 12 21:01:59 UTC 2014

On 01/11/2014 05:06 PM, Russell Bryant wrote:
> On 01/11/2014 11:38 AM, Sean Dague wrote:
>>> 3) (still testing) https://review.openstack.org/#/c/65805/
>>> Right now when tempest runs in the devstack-gate jobs, it runs with
>>> concurrency=4 (run 4 tests at once).  Unfortunately, it appears that
>>> this maxes out the deployment and results in timeouts (usually network
>>> related).
>>> This patch changes tempest concurrency to 2 instead of 4.  The initial
>>> results are quite promising.  The tests have been passing reliably so
>>> far, but we're going to continue to recheck this for a while longer for
>>> more data.
>>> One very interesting observation on this came from Jim where he said "A
>>> quick glance suggests 1.2x -- 1.4x change in runtime."  If the
>>> deployment were *not* being maxed out, we would expect this change to
>>> result in much closer to a 2x runtime increase.
>> We could also address this by locally turning up timeouts on operations
>> that are timing out. Which would let those things take the time they need.
>> Before dropping the concurrency I'd really like to make sure we can
>> point to specific fails that we think will go away. There was a lot of
>> speculation around nova-network, however the nova-network timeout errors
>> only pop up on elastic search on large-ops jobs, not normal tempest
>> jobs. Definitely making OpenStack more idle will make more tests pass.
>> The Neutron team has experienced that.
>> It would be a ton better if we could actually feed back a 503 with a
>> retry time (which I realize is a ton of work).
>> Because if we decide we're now always pinned to only 2way, we have to
>> start doing some major rethinking on our test strategy, as we'll be way
>> outside the soft 45min time budget we've been trying to operate on. We'd
>> actually been planning on going up to 8way, but were waiting for some
>> issues to get fixed before we did that. It would sort of immediately put
>> a moratorium on new tests. If that's what we need to do, that's what we
>> need to do, but we should talk it through.
> I can try to write up some detailed analysis on a few failures next week
> to help justify it, but FWIW, when I was looking this last week, I felt
> like making this change was going to fix a lot more than the
> nova-network timeout errors.
> If we can already tell this is going to improve reliability, both when
> using nova-network and neutron, then I think that should be enough to
> justify it.  Taking longer seems acceptable if that comes with a more
> acceptable pass rate.
> Right now I'd like to see us set concurrency=2 while we work on the more
> difficult performance improvements to both neutron and nova-network, and
> we can turn it back up later on once we're able to demonstrate that it
> passes reliably without failures with a root cause of test load being
> too high.
I have to agree with Russell here. The way we run Tempest has morphed it 
from a simple functional test suite to  include stress/performance test 
characteristics as well. This is great because it has found a lot of 
bugs but obviously there is a huge downside in having such test 
characteristics in the gate at the current failure rate. But it is not 
an either/or between acceptable performance/stress levels and acceptable 
run time. If we cut the concurrency to 2 and split each full tempest job 
into two jobs, each running "half" the tests (based on splitting the 
expected execution time), then we can have both until we are able to 
crank up the concurrency to 8 or beyond.


More information about the OpenStack-dev mailing list