Matt Riedemann <mriedemos@gmail.com> writes:
On 12/3/2018 3:30 PM, James E. Blair wrote:
Since some larger projects consume the bulk of cloud resources in our system, this can be especially frustrating for smaller projects. To be sure, it impacts everyone, but while larger projects receive a continuous stream of results (even if delayed) smaller projects may wait hours before seeing results on a single change.
In order to help all projects maintain a minimal velocity, we've begun dynamically prioritizing node requests based on the number of changes a project has in a given pipeline.
FWIW, and maybe this is happening across the board right now, but it's taking probably ~16 hours to get results on nova changes right now, which becomes increasingly frustrating when they finally get a node, tests run and then the job times out or something because the node is slow (or some other known race test failure).
Is there any way to determine or somehow track how long a change has been queued up before and take that into consideration when it's re-enqueued? Like take this change:
https://review.openstack.org/#/c/620154/
That took about 3 days to merge with constant rechecks from the time it was approved. It would be cool if there was a way to say, from within 50 queued nova changes (using the example in the original email), let's say zuul knew that 10 of those 50 have already gone through one or more times and weigh those differently so when they do get queued up, they are higher in the queue than maybe something that is just going through it's first time.
This suggestion would be difficult to implement, but also, I think it runs counter to some of the ideas that have been put into place in the past. In particular, the idea of clean-check was to make it harder to merge changes with gate failures (under the assumption that they are more likely to introduce racy tests). This might make it easier to recheck-bash bad changes in (along with good). Anyway, we chatted in IRC a bit and came up with another tweak, which is to group projects together in the check pipeline when setting this priority. We already to in gate, but currently, every project in the system gets equal footing in check for their first change. The change under discussion would group all tripleo projects together, and all the integrated projects together, so that the first change for a tripleo project had the same priority as the first change for an integrated project, and a puppet project, etc. The intent is to further reduce the priority "boost" that projects with lots of repos have. The idea is still to try to find a simple and automated way of more fairly distributing our resources. If this doesn't work, we can always return to the previous strict FIFO method. However, given the extreme delays we're seeing across the board, I'm trying to avoid the necessity of actually allocating quota to projects. If we can't make this work, and we aren't able to reduce utilization by improving the reliability of tests (which, by *far* would be the most effective thing to do -- please work with Clark on that), we may have to start talking about that. -Jim