[openstack-dev] Gate Update - Wed Morning Edition
sean at dague.net
Wed Jan 22 14:57:29 UTC 2014
On 01/22/2014 09:38 AM, Sean Dague wrote:
> Things aren't great, but they are actually better than yesterday.
> Vital Stats:
> Gate queue length: 107
> Check queue length: 107
> Head of gate entered: 45hrs ago
> Changes merged in last 24hrs: 58
> The 58 changes merged is actually a good number, not a great number, but
> best we've seen in a number of days. I saw at least a 6 streak merge
> yesterday, so zuul is starting to behave like we expect it should.
> = Previous Top Bugs =
> Our previous top 2 issues - 1270680 and 1270608 (not confusing at all)
> are under control.
> Bug 1270680 - v3 extensions api inherently racey wrt instances
> Russell managed the second part of the fix for this, we've not seen it
> come back since that was ninja merged.
> Bug 1270608 - n-cpu 'iSCSI device not found' log causes
> gate-tempest-dsvm-*-full to fail
> Turning off the test that was triggering this made it completely go
> away. We'll have to revisit if that's because there is a cinder bug or a
> tempest bug, but we'll do that once the dust has settled.
> = New Top Bugs =
> Note: all fail numbers are across all queues
> Bug 1253896 - Attempts to verify guests are running via SSH fails. SSH
> connection to guest does not work.
> 83 fails in 24hrs
> Bug 1224001 - test_network_basic_ops fails waiting for network to become
> 51 fails in 24hrs
> Bug 1254890 - "Timed out waiting for thing" causes tempest-dsvm-* failures
> 30 fails in 24hrs
> We are now sorting - http://status.openstack.org/elastic-recheck/ by
> failures in the last 24hrs, so we can use it more as a hit list. The top
> 3 issues are fingerprinted against infra, but are mostly related to
> normal restart operations at this point.
> = Starvation Update =
> with 214 jobs across queues, and averaging 7 devstack nodes per job, our
> working set is 1498 nodes (i.e. if we had than number we'd be able to be
> running all the jobs right now in parallel).
> Our current quota of nodes gives us ~ 480. Which is < 1/3 our working
> set, and part of the reasons for delays. Rackspace has generously
> increased our quota in 2 of their availability zones, and Monty is going
> to prioritize getting those online.
> Because of Jenkins scaling issues (it starts generating failures when
> talking to too many build slaves), that means spinning up more Jenkins
> masters. We've found a 1 / 100 ratio makes Jenkins basically stable,
> pushing beyond that means new fails. Jenkins is not inherently elastic,
> so this is a somewhat manual process. Monty is diving on that.
> There is also a TCP slow start algorthm for zuul that Clark was working
> on yesterday, which we'll put into production as soon as it is good.
> This will prevent us from speculating all the way down the gate queue,
> just to throw it all away on a reset. It acts just like TCP, on every
> success we grow our speculation length, on every fail we reduce it, with
> a sane minimum so we don't over throttle ourselves.
> Thanks to everyone that's been pitching in digging on reset bugs. More
> help is needed. Many core reviewers are at this point completely
> ignoring normal reviews until the gate is back, so if you are waiting
> for a review on some code, the best way to get it, is help us fix the
> bugs reseting the gate.
One last thing, Anita has also gotten on top of pruning out all the
neutron changes from the gate. Something is very wrong in the neutron
isolated jobs right now, so their chance of passing is close enough to
0, that we need to keep them out of the gate. This is a new regression
in the last couple of days.
This is a contributing factor in the gates moving again.
She and Mark are rallying the Neutron folks to sort this one out.
Samsung Research America
sean at dague.net / sean.dague at samsung.com
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 547 bytes
Desc: OpenPGP digital signature
More information about the OpenStack-dev