<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Jan 19, 2014 at 7:01 AM, Monty Taylor <span dir="ltr"><<a href="mailto:mordred@inaugust.com" target="_blank">mordred@inaugust.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On 01/19/2014 05:38 AM, Sean Dague wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

So, we're currently 70 deep in the gate, top of queue went in > 40 hrs<br>

ago (probably closer to 50 or 60, but we only have enqueue time going<br>

back to the zuul restart).<br>

<br>

I have a couple of ideas about things we should do based on what I've<br>

seen in the gate during this wedge.<br>

<br>

= Remove reverify entirely =<br>

</blockquote>

<br></div>

Yes. Screw it. In a deep queue like now, it's more generally harmful than good.</blockquote><div><br></div><div>I agree with this one, but we should also try to educate the devs, because in the case you brought up below it was a core dev who didn't examine why his patch failed and if he couldn't do reverify bug, he could just do +A.</div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Core reviewers can trigger a requeue with +A state changes. Reverify<br>

right now is exceptional dangerous in that it lets *any* user put<br>

something back in the gate, even if it can't pass. There are a ton of<br>

users that believe they are being helpful in doing so, and making things<br>

a ton worse. stable/havana changes being a prime instance.<br>

<br>

If we were being prolog tricky, I'd actually like to make Jenkins -2<br>

changes need positive run on it before it could be reenqueued. For<br>

instance, I saw a swift core developer run "reverify bug 123456789"<br>

again on a change that couldn't pass. While -2s are mostly races at this<br>

point, the team of people that are choosing to ignore them are not<br>

staying up on what's going on in the queue enough to really know whether<br>

or not trying again is ok.<br>

<br>

= Early Fail Detection =<br>

<br>

With the tempest run now coming in north of an hour, I think we need to<br>

bump up the priority of signally up to jenkins that we're a failure the<br>

first time we see that in the subunit stream. If we fail at 30 minutes,<br>

waiting for 60 until a reset is just adding far more delay.<br>

<br>

I'm not really sure how we get started on this one, but I think we should.<br>

</blockquote>

<br></div>

This one I think will be helpful, but it also is the one that includes that most deep development. Honestly, the chances of getting it done this week are almost none.<br>

<br>

That said - I agree we should accelerate working on it. I have access to a team of folks in India with both python and java backgrounds - if it would be helpful and if we can break out work into, you know, assignable chunks, let me know.<div class="im">


<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

= Pep8 kick out of check =<br>

<br>

I think on the Check Queue we should pep8 first, and not run other tests<br>

until that passes (this reverses a previous opinion I had). We're now<br>

starving nodepool. Preventing taking 5 nodepool nodes on patches that<br>

don't pep8 would be handy. When Dan pushes a 15 patch change that fixes<br>

nova-network, and patch 4 has a pep8 error, we thrash a bunch.<br>

</blockquote>

<br></div>

Agree. I think this might be one of those things that goes back and forth on being a good or bad idea over time. I think now is a time when it's a good idea.</blockquote><div><br></div><div><br></div><div>What about adding a pre-gate queue that makes sure pep8 and unit tests pass before adding a job to the gate (of course this would mean we would have to re-run pep8 and unit tests in the gate). Hopefully this would reduce the amount of gate thrashing incurred by a gate patch that fails one of these jobs.</div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

= More aggressive kick out by zuul =<br>

<br>

We have issues where projects have racing unit tests, which they've not<br>

prioritized fixing. So those create wrecking balls in the gate.<br>

Previously we've been opposed to kicking those out based on the theory<br>

the patch ahead could be the problem (which I've actually never seen).<br>

<br>

However.... this is actually fixable. We could see if there is anything<br>

ahead of it in zuul that runs the same tests. If not, then it's not<br>

possible that something ahead of it could fix it. This is based on the<br>

same logic zuul uses to build the queue in the first place.<br>

<br>

This would shed the wrecking balls earlier.<br>

</blockquote>

<br></div>

Interesting. How would zuul be able to investigate that? Do we need zuul-subunit-consumption for this one too?<div class="im"><br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

= Periodic recheck on old changes =<br>

<br>

I think Michael Still said he was working on this one. Certain projects,<br>

like Glance and Keystone, tend to approve things with really stale test<br>

results (> 1 month old). These fail, and then tumble. They are a be<br>

source of the wrecking balls.<br>

</blockquote>

<br></div>

I believe he's got it working, actually. I think the real trick with this - which I whole-heartedly approve of - is not making node starvation worse.<br>

<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

Tests results > 1 week are clearly irrelevant. For something like nova,<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

3 days can be problematic.<br>

</blockquote>

<br>

I'm sure there are some other ideas, but I wanted to dump this out while<br>

it was fresh in my brain.<br>

<br>

        -Sean<br>

<br>

<br>

<br></div>

______________________________<u></u>_________________<br>

OpenStack-Infra mailing list<br>

<a href="mailto:OpenStack-Infra@lists.openstack.org" target="_blank">OpenStack-Infra@lists.<u></u>openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack-infra</a><br>

<br>

</blockquote>

<br>

<br>

______________________________<u></u>_________________<br>

OpenStack-Infra mailing list<br>

<a href="mailto:OpenStack-Infra@lists.openstack.org" target="_blank">OpenStack-Infra@lists.<u></u>openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra" target="_blank">http://lists.openstack.org/<u></u>cgi-bin/mailman/listinfo/<u></u>openstack-infra</a><br>

</blockquote></div><br></div></div>