[OpenStack-Infra] suggestions for gate optimizations

Sean Dague sean at dague.net
Sun Jan 19 13:38:26 UTC 2014


So, we're currently 70 deep in the gate, top of queue went in > 40 hrs
ago (probably closer to 50 or 60, but we only have enqueue time going
back to the zuul restart).

I have a couple of ideas about things we should do based on what I've
seen in the gate during this wedge.

= Remove reverify entirely =

Core reviewers can trigger a requeue with +A state changes. Reverify
right now is exceptional dangerous in that it lets *any* user put
something back in the gate, even if it can't pass. There are a ton of
users that believe they are being helpful in doing so, and making things
a ton worse. stable/havana changes being a prime instance.

If we were being prolog tricky, I'd actually like to make Jenkins -2
changes need positive run on it before it could be reenqueued. For
instance, I saw a swift core developer run "reverify bug 123456789"
again on a change that couldn't pass. While -2s are mostly races at this
point, the team of people that are choosing to ignore them are not
staying up on what's going on in the queue enough to really know whether
or not trying again is ok.

= Early Fail Detection =

With the tempest run now coming in north of an hour, I think we need to
bump up the priority of signally up to jenkins that we're a failure the
first time we see that in the subunit stream. If we fail at 30 minutes,
waiting for 60 until a reset is just adding far more delay.

I'm not really sure how we get started on this one, but I think we should.

= Pep8 kick out of check =

I think on the Check Queue we should pep8 first, and not run other tests
until that passes (this reverses a previous opinion I had). We're now
starving nodepool. Preventing taking 5 nodepool nodes on patches that
don't pep8 would be handy. When Dan pushes a 15 patch change that fixes
nova-network, and patch 4 has a pep8 error, we thrash a bunch.

= More aggressive kick out by zuul =

We have issues where projects have racing unit tests, which they've not
prioritized fixing. So those create wrecking balls in the gate.
Previously we've been opposed to kicking those out based on the theory
the patch ahead could be the problem (which I've actually never seen).

However.... this is actually fixable. We could see if there is anything
ahead of it in zuul that runs the same tests. If not, then it's not
possible that something ahead of it could fix it. This is based on the
same logic zuul uses to build the queue in the first place.

This would shed the wrecking balls earlier.

= Periodic recheck on old changes =

I think Michael Still said he was working on this one. Certain projects,
like Glance and Keystone, tend to approve things with really stale test
results (> 1 month old). These fail, and then tumble. They are a be
source of the wrecking balls.

Tests results > 1 week are clearly irrelevant. For something like nova,
> 3 days can be problematic.

I'm sure there are some other ideas, but I wanted to dump this out while
it was fresh in my brain.

	-Sean

-- 
Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-infra/attachments/20140119/3b63efb9/attachment.pgp>


More information about the OpenStack-Infra mailing list