Open Stack

Wed Jan 22 15:32:36 UTC 2014

It's worth noticing that elastic recheck is signalling bug 1253896 and bug
1224001 but they have actually the same signature.
I found also interesting that neutron is triggering a lot bug 1254890,
which appears to be a hang on /dev/nbdX during key injection; so far I have
no explanation for that.

As suggested on IRC, the neutron isolated job had a failure rate of about
5-7% last week (until thursday I think). It might be therefore also looking
at tempest/devstack patches which might be triggering failure or uncovering
issues in neutron.

I shared a few findings on the mailing list yesterday ([1]). I hope people
actively looking at failures will find them helpful.

Salvatore

[1]
http://lists.openstack.org/pipermail/openstack-dev/2014-January/025013.html

On 22 January 2014 14:57, Sean Dague <sean at dague.net> wrote:

> On 01/22/2014 09:38 AM, Sean Dague wrote:
> > Things aren't great, but they are actually better than yesterday.
> >
> > Vital Stats:
> >   Gate queue length: 107
> >   Check queue length: 107
> >   Head of gate entered: 45hrs ago
> >   Changes merged in last 24hrs: 58
> >
> > The 58 changes merged is actually a good number, not a great number, but
> > best we've seen in a number of days. I saw at least a 6 streak merge
> > yesterday, so zuul is starting to behave like we expect it should.
> >
> > = Previous Top Bugs =
> >
> > Our previous top 2 issues - 1270680 and 1270608 (not confusing at all)
> > are under control.
> >
> > Bug 1270680 - v3 extensions api inherently racey wrt instances
> >
> > Russell managed the second part of the fix for this, we've not seen it
> > come back since that was ninja merged.
> >
> > Bug 1270608 - n-cpu 'iSCSI device not found' log causes
> > gate-tempest-dsvm-*-full to fail
> >
> > Turning off the test that was triggering this made it completely go
> > away. We'll have to revisit if that's because there is a cinder bug or a
> > tempest bug, but we'll do that once the dust has settled.
> >
> > = New Top Bugs =
> >
> > Note: all fail numbers are across all queues
> >
> > Bug 1253896 - Attempts to verify guests are running via SSH fails. SSH
> > connection to guest does not work.
> >
> > 83 fails in 24hrs
> >
> >
> > Bug 1224001 - test_network_basic_ops fails waiting for network to become
> > available
> >
> > 51 fails in 24hrs
> >
> >
> > Bug 1254890 - "Timed out waiting for thing" causes tempest-dsvm-*
> failures
> >
> > 30 fails in 24hrs
> >
> >
> > We are now sorting - http://status.openstack.org/elastic-recheck/ by
> > failures in the last 24hrs, so we can use it more as a hit list. The top
> > 3 issues are fingerprinted against infra, but are mostly related to
> > normal restart operations at this point.
> >
> > = Starvation Update =
> >
> > with 214 jobs across queues, and averaging 7 devstack nodes per job, our
> > working set is 1498 nodes (i.e. if we had than number we'd be able to be
> > running all the jobs right now in parallel).
> >
> > Our current quota of nodes gives us ~ 480. Which is < 1/3 our working
> > set, and part of the reasons for delays. Rackspace has generously
> > increased our quota in 2 of their availability zones, and Monty is going
> > to prioritize getting those online.
> >
> > Because of Jenkins scaling issues (it starts generating failures when
> > talking to too many build slaves), that means spinning up more Jenkins
> > masters. We've found a 1 / 100 ratio makes Jenkins basically stable,
> > pushing beyond that means new fails. Jenkins is not inherently elastic,
> > so this is a somewhat manual process. Monty is diving on that.
> >
> > There is also a TCP slow start algorthm for zuul that Clark was working
> > on yesterday, which we'll put into production as soon as it is good.
> > This will prevent us from speculating all the way down the gate queue,
> > just to throw it all away on a reset. It acts just like TCP, on every
> > success we grow our speculation length, on every fail we reduce it, with
> > a sane minimum so we don't over throttle ourselves.
> >
> >
> > Thanks to everyone that's been pitching in digging on reset bugs. More
> > help is needed. Many core reviewers are at this point completely
> > ignoring normal reviews until the gate is back, so if you are waiting
> > for a review on some code, the best way to get it, is help us fix the
> > bugs reseting the gate.
>
> One last thing, Anita has also gotten on top of pruning out all the
> neutron changes from the gate. Something is very wrong in the neutron
> isolated jobs right now, so their chance of passing is close enough to
> 0, that we need to keep them out of the gate. This is a new regression
> in the last couple of days.
>
> This is a contributing factor in the gates moving again.
>
> She and Mark are rallying the Neutron folks to sort this one out.
>
>         -Sean
>
> --
> Sean Dague
> Samsung Research America
> sean at dague.net / sean.dague at samsung.com
> http://dague.net
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140122/a9f189a3/attachment.html>

Open Stack

[openstack-dev] Gate Update - Wed Morning Edition

OpenStack

Community

Documentation

Branding & Legal