[infra] A change to Zuul's queuing behavior

James E. Blair corvus at inaugust.com
Fri Dec 7 22:53:27 UTC 2018


Matt Riedemann <mriedemos at gmail.com> writes:

> On 12/3/2018 3:30 PM, James E. Blair wrote:
>> Since some larger projects consume the bulk of cloud resources in our
>> system, this can be especially frustrating for smaller projects.  To be
>> sure, it impacts everyone, but while larger projects receive a
>> continuous stream of results (even if delayed) smaller projects may wait
>> hours before seeing results on a single change.
>>
>> In order to help all projects maintain a minimal velocity, we've begun
>> dynamically prioritizing node requests based on the number of changes a
>> project has in a given pipeline.
>
> FWIW, and maybe this is happening across the board right now, but it's
> taking probably ~16 hours to get results on nova changes right now,
> which becomes increasingly frustrating when they finally get a node,
> tests run and then the job times out or something because the node is
> slow (or some other known race test failure).
>
> Is there any way to determine or somehow track how long a change has
> been queued up before and take that into consideration when it's
> re-enqueued? Like take this change:
>
> https://review.openstack.org/#/c/620154/
>
> That took about 3 days to merge with constant rechecks from the time
> it was approved. It would be cool if there was a way to say, from
> within 50 queued nova changes (using the example in the original
> email), let's say zuul knew that 10 of those 50 have already gone
> through one or more times and weigh those differently so when they do
> get queued up, they are higher in the queue than maybe something that
> is just going through it's first time.

This suggestion would be difficult to implement, but also, I think it
runs counter to some of the ideas that have been put into place
in the past.  In particular, the idea of clean-check was to make it
harder to merge changes with gate failures (under the assumption that
they are more likely to introduce racy tests).  This might make it
easier to recheck-bash bad changes in (along with good).

Anyway, we chatted in IRC a bit and came up with another tweak, which is
to group projects together in the check pipeline when setting this
priority.  We already to in gate, but currently, every project in the
system gets equal footing in check for their first change.  The change
under discussion would group all tripleo projects together, and all the
integrated projects together, so that the first change for a tripleo
project had the same priority as the first change for an integrated
project, and a puppet project, etc.

The intent is to further reduce the priority "boost" that projects with
lots of repos have.

The idea is still to try to find a simple and automated way of more
fairly distributing our resources.  If this doesn't work, we can always
return to the previous strict FIFO method.  However, given the extreme
delays we're seeing across the board, I'm trying to avoid the necessity
of actually allocating quota to projects.  If we can't make this work,
and we aren't able to reduce utilization by improving the reliability of
tests (which, by *far* would be the most effective thing to do -- please
work with Clark on that), we may have to start talking about that.

-Jim



More information about the openstack-discuss mailing list