<div dir="ltr">*These are only observations, so please keep in mind I am only trying to get to the bottom of efficiency with our limited resources.*<div><div>Please feel free to correct my understanding </div><div><br></div><div>We have some core projects which many other projects depend on - Nova, Glance, Keystone, Neutron, Cinder. etc<br></div></div><div>In the CI it's equal access for any project. </div><div>If feature A in non-core project depends on feature B in core project - why is feature B not prioritized ?</div><div><br></div><div>Can we solve this issue by breaking apart the current equal access structure into something more granular?</div><div><br></div><div>I understand that improving job efficiencies will likely result in more smaller jobs, but will that actually solve issue at the gate come this time in the cycle...every release? (as I am sure it comes up every time)</div><div>More smaller jobs will result in more jobs - If the job time is cut in half, but the # of jobs is doubled we will probably still have the same issue.</div><div><br></div><div>We have limited resources and without more providers coming online I fear this issue is only going to get worse as time goes on if we do nothing.</div><div><br></div><div>~/DonnyD<br></div><div><br></div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Sep 13, 2019 at 3:47 PM Matt Riedemann <<a href="mailto:mriedemos@gmail.com">mriedemos@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 9/13/2019 2:03 PM, Clark Boylan wrote:<br>

> We've been fielding a fair bit of questions and suggestions around Zuul's long change (and job) queues over the last week or so. As a result I tried to put a quick FAQ type document [0] on how we schedule jobs, why we schedule that way, and how we can improve the long queues.<br>

> <br>

> Hoping that gives us all a better understanding of why were are in the current situation and ideas on how we can help to improve things.<br>

> <br>

> [0]<a href="https://docs.openstack.org/infra/manual/testing.html#why-are-jobs-for-changes-queued-for-a-long-time" rel="noreferrer" target="_blank">https://docs.openstack.org/infra/manual/testing.html#why-are-jobs-for-changes-queued-for-a-long-time</a><br>

<br>

Thanks for writing this up Clark.<br>

<br>

As for the current status of the gate, several nova devs have been <br>

closely monitoring the gate since we have 3 fairly lengthy series of <br>

feature changes approved since yesterday and we're trying to shepherd <br>

those through but we're seeing failures and trying to react to them.<br>

<br>

Two issues of note this week:<br>

<br>

1. <a href="http://status.openstack.org/elastic-recheck/index.html#1843615" rel="noreferrer" target="_blank">http://status.openstack.org/elastic-recheck/index.html#1843615</a><br>

<br>

I had pushed a fix for that one earlier in the week but there was a bug <br>

in my fix which Takashi has fixed:<br>

<br>

<a href="https://review.opendev.org/#/c/682025/" rel="noreferrer" target="_blank">https://review.opendev.org/#/c/682025/</a><br>

<br>

That was promoted to the gate earlier today but failed on...<br>

<br>

2. <a href="http://status.openstack.org/elastic-recheck/index.html#1813147" rel="noreferrer" target="_blank">http://status.openstack.org/elastic-recheck/index.html#1813147</a><br>

<br>

We have a couple of patches up for that now which might get promoted <br>

once we are reasonably sure those are going to pass check (promote to <br>

gate means skipping check which is risky because if it fails in the gate <br>

we have to re-queue the gate as the doc above explains).<br>

<br>

As far as overall failure classifications we're pretty good there in <br>

elastic-recheck:<br>

<br>

<a href="http://status.openstack.org/elastic-recheck/data/integrated_gate.html" rel="noreferrer" target="_blank">http://status.openstack.org/elastic-recheck/data/integrated_gate.html</a><br>

<br>

Meaning for the most part we know what's failing, we just need to fix <br>

the bugs.<br>

<br>

One that continues to dog us (and by "us" I mean OpenStack, not just <br>

nova) is this one:<br>

<br>

<a href="http://status.openstack.org/elastic-recheck/gate.html#1686542" rel="noreferrer" target="_blank">http://status.openstack.org/elastic-recheck/gate.html#1686542</a><br>

<br>

The QA team's work to split apart the big tempest full jobs into <br>

service-oriented jobs like tempest-integrated-compute should have helped <br>

here but we're still seeing there are lots of jobs timing out which <br>

likely means there are some really slow tests running in too many jobs <br>

and those require investigation. It could also be devstack setup that is <br>

taking a long time like Clark identified with OSC usage awhile back:<br>

<br>

<a href="http://lists.openstack.org/pipermail/openstack-discuss/2019-July/008071.html" rel="noreferrer" target="_blank">http://lists.openstack.org/pipermail/openstack-discuss/2019-July/008071.html</a><br>

<br>

If you have questions about how elastic-recheck works or how to help <br>

investigate some of these failures, like with using <br>

<a href="http://logstash.openstack.org" rel="noreferrer" target="_blank">logstash.openstack.org</a>, please reach out to me (mriedem), clarkb and/or <br>

gmann in #openstack-qa.<br>

<br>

-- <br>

<br>

Thanks,<br>

<br>

Matt<br>

<br>

</blockquote></div>