[openstack-dev] [nova][neutron] top gate bugs: a plea for help

Russell Bryant rbryant at redhat.com
Sat Jan 11 22:06:11 UTC 2014


On 01/11/2014 11:38 AM, Sean Dague wrote:
>> 3) (still testing) https://review.openstack.org/#/c/65805/
>>
>> Right now when tempest runs in the devstack-gate jobs, it runs with
>> concurrency=4 (run 4 tests at once).  Unfortunately, it appears that
>> this maxes out the deployment and results in timeouts (usually network
>> related).
>>
>> This patch changes tempest concurrency to 2 instead of 4.  The initial
>> results are quite promising.  The tests have been passing reliably so
>> far, but we're going to continue to recheck this for a while longer for
>> more data.
>>
>> One very interesting observation on this came from Jim where he said "A
>> quick glance suggests 1.2x -- 1.4x change in runtime."  If the
>> deployment were *not* being maxed out, we would expect this change to
>> result in much closer to a 2x runtime increase.
> 
> We could also address this by locally turning up timeouts on operations
> that are timing out. Which would let those things take the time they need.
> 
> Before dropping the concurrency I'd really like to make sure we can
> point to specific fails that we think will go away. There was a lot of
> speculation around nova-network, however the nova-network timeout errors
> only pop up on elastic search on large-ops jobs, not normal tempest
> jobs. Definitely making OpenStack more idle will make more tests pass.
> The Neutron team has experienced that.
> 
> It would be a ton better if we could actually feed back a 503 with a
> retry time (which I realize is a ton of work).
> 
> Because if we decide we're now always pinned to only 2way, we have to
> start doing some major rethinking on our test strategy, as we'll be way
> outside the soft 45min time budget we've been trying to operate on. We'd
> actually been planning on going up to 8way, but were waiting for some
> issues to get fixed before we did that. It would sort of immediately put
> a moratorium on new tests. If that's what we need to do, that's what we
> need to do, but we should talk it through.

I can try to write up some detailed analysis on a few failures next week
to help justify it, but FWIW, when I was looking this last week, I felt
like making this change was going to fix a lot more than the
nova-network timeout errors.

If we can already tell this is going to improve reliability, both when
using nova-network and neutron, then I think that should be enough to
justify it.  Taking longer seems acceptable if that comes with a more
acceptable pass rate.

Right now I'd like to see us set concurrency=2 while we work on the more
difficult performance improvements to both neutron and nova-network, and
we can turn it back up later on once we're able to demonstrate that it
passes reliably without failures with a root cause of test load being
too high.

>> 5) https://review.openstack.org/#/c/65989/
>>
>> This patch isn't a candidate for merging, but was written to test the
>> theory that by updating nova-network to use conductor instead of direct
>> database access, nova-network will be able to do work in parallel better
>> than it does today, just as we have observed with nova-compute.
>>
>> Dan's initial test results from this are **very** promising.  Initial
>> testing showed a 20% speedup in runtime and a 33% decrease in CPU
>> consumption by nova-network.
>>
>> Doing this properly will not be quick, but I'm hopeful that we can
>> complete it by the Icehouse release.  We will need to convert
>> nova-network to use Nova's object model.  Much of this work is starting
>> to catch nova-network up on work that we've been doing in the rest of
>> the tree but have passed on doing for nova-network due to nova-network
>> being in a freeze.
> 
> I'm a huge +1 on fixing this in nova-network.

Of course.  This is just a bit of a longer term effort.

-- 
Russell Bryant



More information about the OpenStack-dev mailing list