[openstack-dev] Gate Update - Wed Morning Edition

Sean Dague sean at dague.net
Wed Jan 22 14:38:59 UTC 2014


Things aren't great, but they are actually better than yesterday.

Vital Stats:
  Gate queue length: 107
  Check queue length: 107
  Head of gate entered: 45hrs ago
  Changes merged in last 24hrs: 58

The 58 changes merged is actually a good number, not a great number, but
best we've seen in a number of days. I saw at least a 6 streak merge
yesterday, so zuul is starting to behave like we expect it should.

= Previous Top Bugs =

Our previous top 2 issues - 1270680 and 1270608 (not confusing at all)
are under control.

Bug 1270680 - v3 extensions api inherently racey wrt instances

Russell managed the second part of the fix for this, we've not seen it
come back since that was ninja merged.

Bug 1270608 - n-cpu 'iSCSI device not found' log causes
gate-tempest-dsvm-*-full to fail

Turning off the test that was triggering this made it completely go
away. We'll have to revisit if that's because there is a cinder bug or a
tempest bug, but we'll do that once the dust has settled.

= New Top Bugs =

Note: all fail numbers are across all queues

Bug 1253896 - Attempts to verify guests are running via SSH fails. SSH
connection to guest does not work.

83 fails in 24hrs


Bug 1224001 - test_network_basic_ops fails waiting for network to become
available

51 fails in 24hrs


Bug 1254890 - "Timed out waiting for thing" causes tempest-dsvm-* failures

30 fails in 24hrs


We are now sorting - http://status.openstack.org/elastic-recheck/ by
failures in the last 24hrs, so we can use it more as a hit list. The top
3 issues are fingerprinted against infra, but are mostly related to
normal restart operations at this point.

= Starvation Update =

with 214 jobs across queues, and averaging 7 devstack nodes per job, our
working set is 1498 nodes (i.e. if we had than number we'd be able to be
running all the jobs right now in parallel).

Our current quota of nodes gives us ~ 480. Which is < 1/3 our working
set, and part of the reasons for delays. Rackspace has generously
increased our quota in 2 of their availability zones, and Monty is going
to prioritize getting those online.

Because of Jenkins scaling issues (it starts generating failures when
talking to too many build slaves), that means spinning up more Jenkins
masters. We've found a 1 / 100 ratio makes Jenkins basically stable,
pushing beyond that means new fails. Jenkins is not inherently elastic,
so this is a somewhat manual process. Monty is diving on that.

There is also a TCP slow start algorthm for zuul that Clark was working
on yesterday, which we'll put into production as soon as it is good.
This will prevent us from speculating all the way down the gate queue,
just to throw it all away on a reset. It acts just like TCP, on every
success we grow our speculation length, on every fail we reduce it, with
a sane minimum so we don't over throttle ourselves.


Thanks to everyone that's been pitching in digging on reset bugs. More
help is needed. Many core reviewers are at this point completely
ignoring normal reviews until the gate is back, so if you are waiting
for a review on some code, the best way to get it, is help us fix the
bugs reseting the gate.

	-Sean

-- 
Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140122/5599270e/attachment.pgp>


More information about the OpenStack-dev mailing list