[infra] A change to Zuul's queuing behavior
Hi, We recently made a change to how Zuul and Nodepool prioritize node requests. Cloud resources are the major constraint in how long it takes Zuul to run test jobs on proposed changes. Because we're using more resources than ever before (but not necessarily because we're doing more work -- Clark has been helping to identify inefficiencies in other mailing list threads), the amount of time it takes to receive results on a change has been increasing. Since some larger projects consume the bulk of cloud resources in our system, this can be especially frustrating for smaller projects. To be sure, it impacts everyone, but while larger projects receive a continuous stream of results (even if delayed) smaller projects may wait hours before seeing results on a single change. In order to help all projects maintain a minimal velocity, we've begun dynamically prioritizing node requests based on the number of changes a project has in a given pipeline. This means that the first change for every project in the check pipeline has the same priority. The same is true for the second change of each project in the pipeline. The result is that if a project has 50 changes in check, and another project has a single change in check, the second project won't have to wait for all 50 changes ahead before it gets nodes allocated. As conditions change (requests are fulfilled, changes are added and removed) the priorities of any unfulfilled requests are adjusted accordingly. In the gate pipeline, the grouping is by shared change queue. But the gate pipeline still has a higher overall precedence than check. We hope that this will make for a significant improvement in the experience for smaller projects without causing undue hardship for larger ones. We will be closely observing the new behavior and make any necessary tuning adjustments over the next few weeks. Please let us know if you see any adverse impacts, but don't be surprised if you notice node requests being filled "out of order". -Jim
On Mon, 2018-12-03 at 13:30 -0800, James E. Blair wrote:
Hi,
We recently made a change to how Zuul and Nodepool prioritize node requests. Cloud resources are the major constraint in how long it takes Zuul to run test jobs on proposed changes. Because we're using more resources than ever before (but not necessarily because we're doing more work -- Clark has been helping to identify inefficiencies in other mailing list threads), the amount of time it takes to receive results on a change has been increasing.
Since some larger projects consume the bulk of cloud resources in our system, this can be especially frustrating for smaller projects. To be sure, it impacts everyone, but while larger projects receive a continuous stream of results (even if delayed) smaller projects may wait hours before seeing results on a single change.
In order to help all projects maintain a minimal velocity, we've begun dynamically prioritizing node requests based on the number of changes a project has in a given pipeline.
This means that the first change for every project in the check pipeline has the same priority. The same is true for the second change of each project in the pipeline. The result is that if a project has 50 changes in check, and another project has a single change in check, the second project won't have to wait for all 50 changes ahead before it gets nodes allocated. i could be imagineing this but is this not how zuul v2 used to work or rather how the gate was configured a few cycles ago. i remember in the past when i was working on smaller projects it was often quicker to submit patches to those instead fo nova for example.
in partacal i remember working on both os-vif and nova in the past and finding my os-vif jobs would often get started much quicker then nova. anyway i think this is hopefully a good change for the majority of projects but it triggered of feeling of deja vu, was this how the gates used to run?
As conditions change (requests are fulfilled, changes are added and removed) the priorities of any unfulfilled requests are adjusted accordingly.
In the gate pipeline, the grouping is by shared change queue. But the gate pipeline still has a higher overall precedence than check.
We hope that this will make for a significant improvement in the experience for smaller projects without causing undue hardship for larger ones. We will be closely observing the new behavior and make any necessary tuning adjustments over the next few weeks. Please let us know if you see any adverse impacts, but don't be surprised if you notice node requests being filled "out of order".
-Jim
On 12/3/2018 3:30 PM, James E. Blair wrote:
Since some larger projects consume the bulk of cloud resources in our system, this can be especially frustrating for smaller projects. To be sure, it impacts everyone, but while larger projects receive a continuous stream of results (even if delayed) smaller projects may wait hours before seeing results on a single change.
In order to help all projects maintain a minimal velocity, we've begun dynamically prioritizing node requests based on the number of changes a project has in a given pipeline.
FWIW, and maybe this is happening across the board right now, but it's taking probably ~16 hours to get results on nova changes right now, which becomes increasingly frustrating when they finally get a node, tests run and then the job times out or something because the node is slow (or some other known race test failure). Is there any way to determine or somehow track how long a change has been queued up before and take that into consideration when it's re-enqueued? Like take this change: https://review.openstack.org/#/c/620154/ That took about 3 days to merge with constant rechecks from the time it was approved. It would be cool if there was a way to say, from within 50 queued nova changes (using the example in the original email), let's say zuul knew that 10 of those 50 have already gone through one or more times and weigh those differently so when they do get queued up, they are higher in the queue than maybe something that is just going through it's first time. -- Thanks, Matt
Matt Riedemann <mriedemos@gmail.com> writes:
On 12/3/2018 3:30 PM, James E. Blair wrote:
Since some larger projects consume the bulk of cloud resources in our system, this can be especially frustrating for smaller projects. To be sure, it impacts everyone, but while larger projects receive a continuous stream of results (even if delayed) smaller projects may wait hours before seeing results on a single change.
In order to help all projects maintain a minimal velocity, we've begun dynamically prioritizing node requests based on the number of changes a project has in a given pipeline.
FWIW, and maybe this is happening across the board right now, but it's taking probably ~16 hours to get results on nova changes right now, which becomes increasingly frustrating when they finally get a node, tests run and then the job times out or something because the node is slow (or some other known race test failure).
Is there any way to determine or somehow track how long a change has been queued up before and take that into consideration when it's re-enqueued? Like take this change:
https://review.openstack.org/#/c/620154/
That took about 3 days to merge with constant rechecks from the time it was approved. It would be cool if there was a way to say, from within 50 queued nova changes (using the example in the original email), let's say zuul knew that 10 of those 50 have already gone through one or more times and weigh those differently so when they do get queued up, they are higher in the queue than maybe something that is just going through it's first time.
This suggestion would be difficult to implement, but also, I think it runs counter to some of the ideas that have been put into place in the past. In particular, the idea of clean-check was to make it harder to merge changes with gate failures (under the assumption that they are more likely to introduce racy tests). This might make it easier to recheck-bash bad changes in (along with good). Anyway, we chatted in IRC a bit and came up with another tweak, which is to group projects together in the check pipeline when setting this priority. We already to in gate, but currently, every project in the system gets equal footing in check for their first change. The change under discussion would group all tripleo projects together, and all the integrated projects together, so that the first change for a tripleo project had the same priority as the first change for an integrated project, and a puppet project, etc. The intent is to further reduce the priority "boost" that projects with lots of repos have. The idea is still to try to find a simple and automated way of more fairly distributing our resources. If this doesn't work, we can always return to the previous strict FIFO method. However, given the extreme delays we're seeing across the board, I'm trying to avoid the necessity of actually allocating quota to projects. If we can't make this work, and we aren't able to reduce utilization by improving the reliability of tests (which, by *far* would be the most effective thing to do -- please work with Clark on that), we may have to start talking about that. -Jim
---- On Sat, 08 Dec 2018 07:53:27 +0900 James E. Blair <corvus@inaugust.com> wrote ----
Matt Riedemann <mriedemos@gmail.com> writes:
On 12/3/2018 3:30 PM, James E. Blair wrote:
Since some larger projects consume the bulk of cloud resources in our system, this can be especially frustrating for smaller projects. To be sure, it impacts everyone, but while larger projects receive a continuous stream of results (even if delayed) smaller projects may wait hours before seeing results on a single change.
In order to help all projects maintain a minimal velocity, we've begun dynamically prioritizing node requests based on the number of changes a project has in a given pipeline.
FWIW, and maybe this is happening across the board right now, but it's taking probably ~16 hours to get results on nova changes right now, which becomes increasingly frustrating when they finally get a node, tests run and then the job times out or something because the node is slow (or some other known race test failure).
Is there any way to determine or somehow track how long a change has been queued up before and take that into consideration when it's re-enqueued? Like take this change:
https://review.openstack.org/#/c/620154/
That took about 3 days to merge with constant rechecks from the time it was approved. It would be cool if there was a way to say, from within 50 queued nova changes (using the example in the original email), let's say zuul knew that 10 of those 50 have already gone through one or more times and weigh those differently so when they do get queued up, they are higher in the queue than maybe something that is just going through it's first time.
This suggestion would be difficult to implement, but also, I think it runs counter to some of the ideas that have been put into place in the past. In particular, the idea of clean-check was to make it harder to merge changes with gate failures (under the assumption that they are more likely to introduce racy tests). This might make it easier to recheck-bash bad changes in (along with good).
Anyway, we chatted in IRC a bit and came up with another tweak, which is to group projects together in the check pipeline when setting this priority. We already to in gate, but currently, every project in the system gets equal footing in check for their first change. The change under discussion would group all tripleo projects together, and all the integrated projects together, so that the first change for a tripleo project had the same priority as the first change for an integrated project, and a puppet project, etc.
The intent is to further reduce the priority "boost" that projects with lots of repos have.
The idea is still to try to find a simple and automated way of more fairly distributing our resources. If this doesn't work, we can always return to the previous strict FIFO method. However, given the extreme delays we're seeing across the board, I'm trying to avoid the necessity of actually allocating quota to projects. If we can't make this work, and we aren't able to reduce utilization by improving the reliability of tests (which, by *far* would be the most effective thing to do -- please work with Clark on that), we may have to start talking about that.
-Jim
We can optimize the node by removing the job from running queue on the first failure hit instead of full run and then release the node. This is a trade-off with getting the all failure once and fix them all together but I am not sure if that is the case all time. For example- if any change has pep8 error then, no need to run integration tests jobs there. This at least can save nodes at some extent. -gmann
On 2018-12-09 23:14:37 +0900 (+0900), Ghanshyam Mann wrote: [...]
We can optimize the node by removing the job from running queue on the first failure hit instead of full run and then release the node. This is a trade-off with getting the all failure once and fix them all together but I am not sure if that is the case all time. For example- if any change has pep8 error then, no need to run integration tests jobs there. This at least can save nodes at some extent.
I can recall plenty of times where I've pushed a change which failed pep8 on some non-semantic whitespace complaint and also had unit test or integration test failures. In those cases it's quite obvious that the pep8 failure reason couldn't have been the reason for the other failed jobs so seeing them all saved me wasting time on additional patches and waiting for more rounds of results. For that matter, a lot of my time as a developer (or even as a reviewer) is saved by seeing which clusters of jobs fail for a given change. For example, if I see all unit test jobs fail but integration test jobs pass I can quickly infer that there may be issues with a unit test that's being modified and spend less time fumbling around in the dark with various logs. It's possible we can save some CI resource consumption with such a trade-off, but doing so comes at the expense of developer and reviewer time so we have to make sure it's worthwhile. There was a point in the past where we did something similar (only run other jobs if a canary linter job passed), and there are good reasons why we didn't continue it. -- Jeremy Stanley
participants (6)
-
Chris Friesen
-
corvus@inaugust.com
-
Ghanshyam Mann
-
Jeremy Stanley
-
Matt Riedemann
-
Sean Mooney