[openstack-dev] [nova][neutron] top gate bugs: a plea for help

Sean Dague sean at dague.net
Sat Jan 11 16:38:28 UTC 2014

First, thanks a ton for diving in on all this Russell. The big push by 
the Nova team recently is really helpful.

On 01/11/2014 09:57 AM, Russell Bryant wrote:
> On 01/09/2014 04:16 PM, Russell Bryant wrote:
>> On 01/08/2014 05:53 PM, Joe Gordon wrote:
>>> Hi All,
>>> As you know the gate has been in particularly bad shape (gate queue over
>>> 100!) this week due to a number of factors. One factor is how many major
>>> outstanding bugs we have in the gate.  Below is a list of the top 4 open
>>> gate bugs.
>>> Here are some fun facts about this list:
>>> * All bugs have been open for over a month
>>> * All are nova bugs
>>> * These 4 bugs alone were hit 588 times which averages to 42 hits per
>>> day (data is over two weeks)!
>>> If we want the gate queue to drop and not have to continuously run
>>> 'recheck bug x' we need to fix these bugs.  So I'm looking for
>>> volunteers to help debug and fix these bugs.
>> I created the following etherpad to help track the most important Nova
>> gate bugs. who is actively working on them, and any patches that we have
>> in flight to help address them:
>>    https://etherpad.openstack.org/p/nova-gate-issue-tracking
>> Please jump in if you can.  We shouldn't wait for the gate bug day to
>> move on these.  Even if others are already looking at a bug, feel free
>> to do the same.  We need multiple sets of eyes on each of these issues.
> Some good progress from the last few days:
> After looking at a lot of failures, we determined that the vast majority
> of failures are performance related.  The load being put on the
> OpenStack deployment is just too high.  We're working to address this to
> make the gate more reliable in a number of ways.
> 1) (merged) https://review.openstack.org/#/c/65760/
> The large-ops test was cut back from spawning 100 instances to 50.  From
> the commit message:
>    It turns out the variance in cloud instances is very high, especially
>    when comparing different cloud providers and regions. This test was
>    originally added as a regression test for the nova-network issues with
>    rootwrap. At which time this test wouldn't pass for 30 instances.  So
>    50 is still a valid regression test.
> 2) (merged) https://review.openstack.org/#/c/45766/
> nova-compute is able to do work in parallel very well.  nova-conductor
> can not by default due to the details of our use of eventlet + how we
> talk to MySQL.  The way you allow nova-conductor to do its work in
> parallel is by running multiple conductor workers.  We had not enabled
> this by default in devstack, so our 4 vCPU test nodes were only using a
> single conductor worker.  They now use 4 conductor workers.
> 3) (still testing) https://review.openstack.org/#/c/65805/
> Right now when tempest runs in the devstack-gate jobs, it runs with
> concurrency=4 (run 4 tests at once).  Unfortunately, it appears that
> this maxes out the deployment and results in timeouts (usually network
> related).
> This patch changes tempest concurrency to 2 instead of 4.  The initial
> results are quite promising.  The tests have been passing reliably so
> far, but we're going to continue to recheck this for a while longer for
> more data.
> One very interesting observation on this came from Jim where he said "A
> quick glance suggests 1.2x -- 1.4x change in runtime."  If the
> deployment were *not* being maxed out, we would expect this change to
> result in much closer to a 2x runtime increase.

We could also address this by locally turning up timeouts on operations 
that are timing out. Which would let those things take the time they need.

Before dropping the concurrency I'd really like to make sure we can 
point to specific fails that we think will go away. There was a lot of 
speculation around nova-network, however the nova-network timeout errors 
only pop up on elastic search on large-ops jobs, not normal tempest 
jobs. Definitely making OpenStack more idle will make more tests pass. 
The Neutron team has experienced that.

It would be a ton better if we could actually feed back a 503 with a 
retry time (which I realize is a ton of work).

Because if we decide we're now always pinned to only 2way, we have to 
start doing some major rethinking on our test strategy, as we'll be way 
outside the soft 45min time budget we've been trying to operate on. We'd 
actually been planning on going up to 8way, but were waiting for some 
issues to get fixed before we did that. It would sort of immediately put 
a moratorium on new tests. If that's what we need to do, that's what we 
need to do, but we should talk it through.

> 4) (approved, not yet merged) https://review.openstack.org/#/c/65784/
> nova-network seems to be the largest bottleneck in terms of performance
> problems when nova is maxed out on these test nodes.  This patch is one
> quick speedup we can make by not using rootwrap in a few cases where it
> wasn't necessary.  These really add up.
> 5) https://review.openstack.org/#/c/65989/
> This patch isn't a candidate for merging, but was written to test the
> theory that by updating nova-network to use conductor instead of direct
> database access, nova-network will be able to do work in parallel better
> than it does today, just as we have observed with nova-compute.
> Dan's initial test results from this are **very** promising.  Initial
> testing showed a 20% speedup in runtime and a 33% decrease in CPU
> consumption by nova-network.
> Doing this properly will not be quick, but I'm hopeful that we can
> complete it by the Icehouse release.  We will need to convert
> nova-network to use Nova's object model.  Much of this work is starting
> to catch nova-network up on work that we've been doing in the rest of
> the tree but have passed on doing for nova-network due to nova-network
> being in a freeze.

I'm a huge +1 on fixing this in nova-network.

> 6) (no patch yet)
> We haven't had time to dive too deep into this yet, but we would also
> like to revisit our locking usage and how it is affecting nova-network
> performance.  There may be some more significant improvements we can
> make there.
> Final notes:
> I am hopeful that by addressing these performance issues both in Nova's
> code, as well as by turning down the test load, that we will see a
> significant increase in gate reliability in the near future.  I
> apologize on behalf of the Nova team for Nova's contribution to gate
> instability.
> *Thank you* to everyone who has been helping out!

Yes, thanks much to everyone here.


Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com

More information about the OpenStack-dev mailing list