[OpenStack-Infra] suggestions for gate optimizations

Joe Gordon joe.gordon0 at gmail.com
Mon Jan 20 04:38:10 UTC 2014


On Sun, Jan 19, 2014 at 7:01 AM, Monty Taylor <mordred at inaugust.com> wrote:

> On 01/19/2014 05:38 AM, Sean Dague wrote:
>
>> So, we're currently 70 deep in the gate, top of queue went in > 40 hrs
>> ago (probably closer to 50 or 60, but we only have enqueue time going
>> back to the zuul restart).
>>
>> I have a couple of ideas about things we should do based on what I've
>> seen in the gate during this wedge.
>>
>> = Remove reverify entirely =
>>
>
> Yes. Screw it. In a deep queue like now, it's more generally harmful than
> good.


I agree with this one, but we should also try to educate the devs, because
in the case you brought up below it was a core dev who didn't examine why
his patch failed and if he couldn't do reverify bug, he could just do +A.


>
>
>  Core reviewers can trigger a requeue with +A state changes. Reverify
>> right now is exceptional dangerous in that it lets *any* user put
>> something back in the gate, even if it can't pass. There are a ton of
>> users that believe they are being helpful in doing so, and making things
>> a ton worse. stable/havana changes being a prime instance.
>>
>> If we were being prolog tricky, I'd actually like to make Jenkins -2
>> changes need positive run on it before it could be reenqueued. For
>> instance, I saw a swift core developer run "reverify bug 123456789"
>> again on a change that couldn't pass. While -2s are mostly races at this
>> point, the team of people that are choosing to ignore them are not
>> staying up on what's going on in the queue enough to really know whether
>> or not trying again is ok.
>>
>> = Early Fail Detection =
>>
>> With the tempest run now coming in north of an hour, I think we need to
>> bump up the priority of signally up to jenkins that we're a failure the
>> first time we see that in the subunit stream. If we fail at 30 minutes,
>> waiting for 60 until a reset is just adding far more delay.
>>
>> I'm not really sure how we get started on this one, but I think we should.
>>
>
> This one I think will be helpful, but it also is the one that includes
> that most deep development. Honestly, the chances of getting it done this
> week are almost none.
>
> That said - I agree we should accelerate working on it. I have access to a
> team of folks in India with both python and java backgrounds - if it would
> be helpful and if we can break out work into, you know, assignable chunks,
> let me know.
>
>
>  = Pep8 kick out of check =
>>
>> I think on the Check Queue we should pep8 first, and not run other tests
>> until that passes (this reverses a previous opinion I had). We're now
>> starving nodepool. Preventing taking 5 nodepool nodes on patches that
>> don't pep8 would be handy. When Dan pushes a 15 patch change that fixes
>> nova-network, and patch 4 has a pep8 error, we thrash a bunch.
>>
>
> Agree. I think this might be one of those things that goes back and forth
> on being a good or bad idea over time. I think now is a time when it's a
> good idea.



What about adding a pre-gate queue that makes sure pep8 and unit tests pass
before adding a job to the gate (of course this would mean we would have to
re-run pep8 and unit tests in the gate). Hopefully this would reduce the
amount of gate thrashing incurred by a gate patch that fails one of these
jobs.


>
>
>  = More aggressive kick out by zuul =
>>
>> We have issues where projects have racing unit tests, which they've not
>> prioritized fixing. So those create wrecking balls in the gate.
>> Previously we've been opposed to kicking those out based on the theory
>> the patch ahead could be the problem (which I've actually never seen).
>>
>> However.... this is actually fixable. We could see if there is anything
>> ahead of it in zuul that runs the same tests. If not, then it's not
>> possible that something ahead of it could fix it. This is based on the
>> same logic zuul uses to build the queue in the first place.
>>
>> This would shed the wrecking balls earlier.
>>
>
> Interesting. How would zuul be able to investigate that? Do we need
> zuul-subunit-consumption for this one too?
>
>
>  = Periodic recheck on old changes =
>>
>> I think Michael Still said he was working on this one. Certain projects,
>> like Glance and Keystone, tend to approve things with really stale test
>> results (> 1 month old). These fail, and then tumble. They are a be
>> source of the wrecking balls.
>>
>
> I believe he's got it working, actually. I think the real trick with this
> - which I whole-heartedly approve of - is not making node starvation worse.
>
>  Tests results > 1 week are clearly irrelevant. For something like nova,
>>
>>> 3 days can be problematic.
>>>
>>
>> I'm sure there are some other ideas, but I wanted to dump this out while
>> it was fresh in my brain.
>>
>>         -Sean
>>
>>
>>
>> _______________________________________________
>> OpenStack-Infra mailing list
>> OpenStack-Infra at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>>
>>
>
> _______________________________________________
> OpenStack-Infra mailing list
> OpenStack-Infra at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-infra/attachments/20140119/2479fd37/attachment-0001.html>


More information about the OpenStack-Infra mailing list