[infra] A change to Zuul's queuing behavior
Ghanshyam Mann
gmann at ghanshyammann.com
Sun Dec 9 14:14:37 UTC 2018
---- On Sat, 08 Dec 2018 07:53:27 +0900 James E. Blair <corvus at inaugust.com> wrote ----
> Matt Riedemann <mriedemos at gmail.com> writes:
>
> > On 12/3/2018 3:30 PM, James E. Blair wrote:
> >> Since some larger projects consume the bulk of cloud resources in our
> >> system, this can be especially frustrating for smaller projects. To be
> >> sure, it impacts everyone, but while larger projects receive a
> >> continuous stream of results (even if delayed) smaller projects may wait
> >> hours before seeing results on a single change.
> >>
> >> In order to help all projects maintain a minimal velocity, we've begun
> >> dynamically prioritizing node requests based on the number of changes a
> >> project has in a given pipeline.
> >
> > FWIW, and maybe this is happening across the board right now, but it's
> > taking probably ~16 hours to get results on nova changes right now,
> > which becomes increasingly frustrating when they finally get a node,
> > tests run and then the job times out or something because the node is
> > slow (or some other known race test failure).
> >
> > Is there any way to determine or somehow track how long a change has
> > been queued up before and take that into consideration when it's
> > re-enqueued? Like take this change:
> >
> > https://review.openstack.org/#/c/620154/
> >
> > That took about 3 days to merge with constant rechecks from the time
> > it was approved. It would be cool if there was a way to say, from
> > within 50 queued nova changes (using the example in the original
> > email), let's say zuul knew that 10 of those 50 have already gone
> > through one or more times and weigh those differently so when they do
> > get queued up, they are higher in the queue than maybe something that
> > is just going through it's first time.
>
> This suggestion would be difficult to implement, but also, I think it
> runs counter to some of the ideas that have been put into place
> in the past. In particular, the idea of clean-check was to make it
> harder to merge changes with gate failures (under the assumption that
> they are more likely to introduce racy tests). This might make it
> easier to recheck-bash bad changes in (along with good).
>
> Anyway, we chatted in IRC a bit and came up with another tweak, which is
> to group projects together in the check pipeline when setting this
> priority. We already to in gate, but currently, every project in the
> system gets equal footing in check for their first change. The change
> under discussion would group all tripleo projects together, and all the
> integrated projects together, so that the first change for a tripleo
> project had the same priority as the first change for an integrated
> project, and a puppet project, etc.
>
> The intent is to further reduce the priority "boost" that projects with
> lots of repos have.
>
> The idea is still to try to find a simple and automated way of more
> fairly distributing our resources. If this doesn't work, we can always
> return to the previous strict FIFO method. However, given the extreme
> delays we're seeing across the board, I'm trying to avoid the necessity
> of actually allocating quota to projects. If we can't make this work,
> and we aren't able to reduce utilization by improving the reliability of
> tests (which, by *far* would be the most effective thing to do -- please
> work with Clark on that), we may have to start talking about that.
>
> -Jim
We can optimize the node by removing the job from running queue on the first failure hit instead of
full run and then release the node. This is a trade-off with getting the all failure once and fix them all together
but I am not sure if that is the case all time. For example- if any change has pep8 error then, no need to run
integration tests jobs there. This at least can save nodes at some extent.
-gmann
>
>
More information about the openstack-discuss
mailing list