[all] Gate resources and performance
Hi all,
I have become increasingly concerned with CI performance lately, and have been raising those concerns with various people. Most specifically, I'm worried about our turnaround time or "time to get a result", which has been creeping up lately. Right after the beginning of the year, we had a really bad week where the turnaround time was well over 24 hours. That means if you submit a patch on Tuesday afternoon, you might not get a test result until Thursday. That is, IMHO, a real problem and massively hurts our ability to quickly merge priority fixes as well as just general velocity and morale. If people won't review my code until they see a +1 from Zuul, and that is two days after I submitted it, that's bad.
Things have gotten a little better since that week, due in part to getting past a rush of new year submissions (we think) and also due to some job trimming in various places (thanks Neutron!). However, things are still not great. Being in almost the last timezone of the day, the queue is usually so full when I wake up that it's quite often I don't get to see a result before I stop working that day.
I would like to ask that projects review their jobs for places where they can cut out redundancy, as well as turn their eyes towards optimizations that can be made. I've been looking at both Nova and Glance jobs and have found some things I think we can do less of. I also wanted to get an idea of who is "using too much" in the way of resources, so I've been working on trying to characterize the weight of the jobs we run for a project, based on the number of worker nodes required to run all the jobs, as well as the wall clock time of how long we tie those up. The results are interesting, I think, and may help us to identify where we see some gains.
The idea here is to figure out[1] how many "node hours" it takes to run all the normal jobs on a Nova patch compared to, say, a Neutron one. If the jobs were totally serialized, this is the number of hours a single computer (of the size of a CI worker) would take to do all that work. If the number is 24 hours, that means a single computer could only check *one* patch in a day, running around the clock. I chose the top five projects in terms of usage[2] to report here, as they represent 70% of the total amount of resources consumed. The next five only add up to 13%, so the "top five" seems like a good target group. Here are the results, in order of total consumption:
Project % of total Node Hours Nodes ------------------------------------------ 1. TripleO 38% 31 hours 20 2. Neutron 13% 38 hours 32 3. Nova 9% 21 hours 25 4. Kolla 5% 12 hours 18 5. OSA 5% 22 hours 17
What that means is that a single computer (of the size of a CI worker) couldn't even process the jobs required to run on a single patch for Neutron or TripleO in a 24-hour period. Now, we have lots of workers in the gate, of course, but there is also other potential overhead involved in that parallelism, like waiting for nodes to be available for dependent jobs. And of course, we'd like to be able to check more than patch per day. Most projects have smaller gate job sets than check, but assuming they are equivalent, a Neutron patch from submission to commit would undergo 76 hours of testing, not including revisions and not including rechecks. That's an enormous amount of time and resource for a single patch!
Now, obviously nobody wants to run fewer tests on patches before they land, and I'm not really suggesting that we take that approach necessarily. However, I think there are probably a lot of places that we can cut down the amount of *work* we do. Some ways to do this are:
1. Evaluate whether or not you need to run all of tempest on two configurations of a devstack on each patch. Maybe having a stripped-down tempest (like just smoke) to run on unique configs, or even specific tests. 2. Revisit your "irrelevant_files" lists to see where you might be able to avoid running heavy jobs on patches that only touch something small. 3. Consider moving some jobs to the experimental queue and run them on-demand for patches that touch particular subsystems or affect particular configurations. 4. Consider some periodic testing for things that maybe don't need to run on every single patch. 5. Re-examine tests that take a long time to run to see if something can be done to make them more efficient. 6. Consider performance improvements in the actual server projects, which also benefits the users.
If you're a project that is not in the top ten then your job configuration probably doesn't matter that much, since your usage is dwarfed by the heavy projects. If the heavy projects would consider making changes to decrease their workload, even small gains have the ability to multiply into noticeable improvement. The higher you are on the above list, the more impact a small change will have on the overall picture.
Also, thanks to Neutron and TripleO, both of which have already addressed this in some respect, and have other changes on the horizon.
Thanks for listening!
--Dan
1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/
Hi!
For OSA huge issue is how zuul clones required-projects. Just this single action takes for us from 6 to 10 minutes. It's not _so_ big amount of time but pretty fair, considering that we can decrease it for each CI job. Moreover, I think we're not alone who has more several repos in required-projects
And maybe we have some kind of solution, which is ansible module [1] for parallel git clone. It speeds up process dramatically from what we see in our non-ci deployments. But it needs some time and resources for integration into zuul and I don't think we will be able to spend a lot of time on it during this cycle.
Also we can probably decrease coverage for some operating systems, but we're about testing minimum of the required stuff and user scenarios out of possible amount of them. I will still try to drop something to the experimental pipeline though.
[1] https://opendev.org/openstack/openstack-ansible/src/branch/master/playbooks/...
04.02.2021, 19:35, "Dan Smith" dms@danplanet.com:
Hi all,
I have become increasingly concerned with CI performance lately, and have been raising those concerns with various people. Most specifically, I'm worried about our turnaround time or "time to get a result", which has been creeping up lately. Right after the beginning of the year, we had a really bad week where the turnaround time was well over 24 hours. That means if you submit a patch on Tuesday afternoon, you might not get a test result until Thursday. That is, IMHO, a real problem and massively hurts our ability to quickly merge priority fixes as well as just general velocity and morale. If people won't review my code until they see a +1 from Zuul, and that is two days after I submitted it, that's bad.
Things have gotten a little better since that week, due in part to getting past a rush of new year submissions (we think) and also due to some job trimming in various places (thanks Neutron!). However, things are still not great. Being in almost the last timezone of the day, the queue is usually so full when I wake up that it's quite often I don't get to see a result before I stop working that day.
I would like to ask that projects review their jobs for places where they can cut out redundancy, as well as turn their eyes towards optimizations that can be made. I've been looking at both Nova and Glance jobs and have found some things I think we can do less of. I also wanted to get an idea of who is "using too much" in the way of resources, so I've been working on trying to characterize the weight of the jobs we run for a project, based on the number of worker nodes required to run all the jobs, as well as the wall clock time of how long we tie those up. The results are interesting, I think, and may help us to identify where we see some gains.
The idea here is to figure out[1] how many "node hours" it takes to run all the normal jobs on a Nova patch compared to, say, a Neutron one. If the jobs were totally serialized, this is the number of hours a single computer (of the size of a CI worker) would take to do all that work. If the number is 24 hours, that means a single computer could only check *one* patch in a day, running around the clock. I chose the top five projects in terms of usage[2] to report here, as they represent 70% of the total amount of resources consumed. The next five only add up to 13%, so the "top five" seems like a good target group. Here are the results, in order of total consumption:
Project % of total Node Hours Nodes ------------------------------------------ 1. TripleO 38% 31 hours 20 2. Neutron 13% 38 hours 32 3. Nova 9% 21 hours 25 4. Kolla 5% 12 hours 18 5. OSA 5% 22 hours 17
What that means is that a single computer (of the size of a CI worker) couldn't even process the jobs required to run on a single patch for Neutron or TripleO in a 24-hour period. Now, we have lots of workers in the gate, of course, but there is also other potential overhead involved in that parallelism, like waiting for nodes to be available for dependent jobs. And of course, we'd like to be able to check more than patch per day. Most projects have smaller gate job sets than check, but assuming they are equivalent, a Neutron patch from submission to commit would undergo 76 hours of testing, not including revisions and not including rechecks. That's an enormous amount of time and resource for a single patch!
Now, obviously nobody wants to run fewer tests on patches before they land, and I'm not really suggesting that we take that approach necessarily. However, I think there are probably a lot of places that we can cut down the amount of *work* we do. Some ways to do this are:
- Evaluate whether or not you need to run all of tempest on two
configurations of a devstack on each patch. Maybe having a stripped-down tempest (like just smoke) to run on unique configs, or even specific tests. 2. Revisit your "irrelevant_files" lists to see where you might be able to avoid running heavy jobs on patches that only touch something small. 3. Consider moving some jobs to the experimental queue and run them on-demand for patches that touch particular subsystems or affect particular configurations. 4. Consider some periodic testing for things that maybe don't need to run on every single patch. 5. Re-examine tests that take a long time to run to see if something can be done to make them more efficient. 6. Consider performance improvements in the actual server projects, which also benefits the users.
If you're a project that is not in the top ten then your job configuration probably doesn't matter that much, since your usage is dwarfed by the heavy projects. If the heavy projects would consider making changes to decrease their workload, even small gains have the ability to multiply into noticeable improvement. The higher you are on the above list, the more impact a small change will have on the overall picture.
Also, thanks to Neutron and TripleO, both of which have already addressed this in some respect, and have other changes on the horizon.
Thanks for listening!
--Dan
1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/
-- Kind Regards, Dmitriy Rabotyagov
Another thing that I think may help us saving some CI time and that affects most of the projects is pyenv build. There was a change made to the zuul jobs that implements usage of stow. So we spend time on building all major python version in images and doing instant select of valid binary in jobs, rather then waiting for pyenv build during in pipelines.
Hmm, I guess I didn't realize most of the projects were spending time on this, but it looks like a good thread to chase.
I've been digging through devstack looking for opportunities to make things faster lately. We do a ton of pip invocations, most of which I think are not really necessary, and the ones that are could be batched into a single go at the front to save quite a bit of time. Just a pip install of a requirements file seems to take a while, even when there's nothing that needs installing. We do that in devstack a lot. We also rebuild the tempest venv several times for reasons that I don't understand.
So yeah, these are the kinds of things I'd really like to see people spend some time on. It is an investment, but worth it because the multiplier is so large. The CI system is so awesome in that it's just a tool that is there and easy to build on. But just like anything that makes stuff easy initially, there are often gaps entombed around quick work that need to be revisited over time.
So, thanks for bringing this up as a potential thread for improvement!
--Dan
On 2021-02-04 22:29:35 +0200 (+0200), Dmitriy Rabotyagov wrote:
For OSA huge issue is how zuul clones required-projects. Just this single action takes for us from 6 to 10 minutes.
[...]
I'd be curious to see some examples of this. Zuul doesn't clone required-projects, but it does push new commits from a cache on the executor to a cache on the node. The executor side caches are updated continually as new builds are scheduled, and the caches on the nodes are refreshed every time the images from which they're booted are assembled (typically daily unless something has temporarily broken our ability to rebuild a particular image). So on average, the executor is pushing only 12 hours worth of new commits for each required project. I don't recall if it performs those push operations in parallel, but I suppose that's something we could look into.
On Thu, 2021-02-04 at 22:54 +0000, Jeremy Stanley wrote:
On 2021-02-04 22:29:35 +0200 (+0200), Dmitriy Rabotyagov wrote:
For OSA huge issue is how zuul clones required-projects. Just this single action takes for us from 6 to 10 minutes.
[...]
I'd be curious to see some examples of this. Zuul doesn't clone required-projects, but it does push new commits from a cache on the executor to a cache on the node.
right originally that was don by https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace... as part of the base job https://opendev.org/opendev/base-jobs/src/branch/master/playbooks/base/pre.y... that just sysncs the deltas using git between the git repo prepared on the zuul executor with those precached in the images.
that will however only updte teh repos that are needed by your job via required-projects. it ensure that you are actully testing what you think you are testing and that you never need to cloue any of the repos you are test since they will be prepared for you.
this should be much faster then clonneing or pulling the repos yourself in the job and more imporantly it avoid netwokring issue that can happen if you try to clone in the job from gerrit. its also what makes depends on work which is tricky if you dont leave zuul prepare the repos for you.
The executor side caches are updated continually as new builds are scheduled, and the caches on the nodes are refreshed every time the images from which they're booted are assembled (typically daily unless something has temporarily broken our ability to rebuild a particular image). So on average, the executor is pushing only 12 hours worth of new commits for each required project. I don't recall if it performs those push operations in parallel, but I suppose that's something we could look into.
its not parallel if im reading this write but typically you will not need to pull alot of repos. https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace... as in its maybe in the low 10s for typical full tempest jobs.
OSA is slightly pathalogical in howe it use require porjects
https://opendev.org/openstack/openstack-ansible/src/branch/master/zuul.d/job...
its pulling in a large propoation of the openstack repos. its not suprising its slower then we typeicaly would expect but it shoudl be fater then if you actully clonned them without useing the cache in the image and executor.
doing the clone in parallel would help in this case but it might also make sense to reasses how osa stuctures its jobs for example osa support both source and non souce installs correct. the non souce installs dont need the openstack pojects just the osa repos since it will be using the binary packages.
so if you had a second intermeitady job of the souce install the the openstack compoenta repos listed you could skip updating 50 repos in your binary jobs (im assumning the _distro_ jobs are binary by the way.) currenlty its updating 105 for every job that is based on openstack-ansible-deploy-aio
That is actually very good idea, thanks! Eventually distro jobs take the way less time then source ones anyway, so I wasn't thinking a lot how to optimize them, while it's also important. So pushed [1] to cover that.
Unfortunatelly haven't found the way to properly flatten the list of projects in case of yaml anchors usage :(
for example osa support both source and non souce installs correct. the non souce installs dont need the openstack pojects just the osa repos since it will be using the binary packages.
so if you had a second intermeitady job of the souce install the the openstack compoenta repos listed you could skip updating 50 repos in your binary jobs (im assumning the _distro_ jobs are binary by the way.) currenlty its updating 105 for every job that is based on openstack-ansible-deploy-aio
[1] https://review.opendev.org/c/openstack/openstack-ansible/+/774372/1/zuul.d/j...
-- Kind Regards, Dmitriy Rabotyagov
On 2021-02-07 09:28:35 +0200 (+0200), Dmitriy Rabotyagov wrote:
Once you said that, I looked through the actual code of the prepare-workspace-git role more carefully and you're right - all actions are made against already cached repos there. However since it mostly uses commands, it would still be the way more efficient to make up some module to replace all commands/shell to run things in multiprocess way. Regarding example, you can take any random task from osa, ie [1] - it takes a bit more then 6 mins. When load on providers is high (or their volume backend io is poor), time increases
[...]
Okay, so that's these tasks:
https://opendev.org/zuul/zuul-jobs/src/commit/8bdb2b538c79dd75bac14180b905a1... https://opendev.org/zuul/zuul-jobs/src/commit/8bdb2b538c79dd75bac14180b905a1...
It's doing a git clone from the cache on the node into the workspace (in theory from one path to another within the same filesystem, which should normally just result in git creating hardlinks to the original objects/packs), and that took 101 seconds to clone 106 repositories. After that, 83 seconds were spent fixing up configuration on each of those clones. The longest step does indeed seem to be the 128 seconds where it pushed updated refs from the cache on the executor over the network into the prepared workspace on the remote build node.
I wonder if combining these into a single loop could help reduce the iteration overhead, or whether processing repositories in parallel would help (if they're limited by I/O bandwidth then I expect not)? Regardless, yeah, 5m12s does seem like a good chunk of time. On the other hand, it's worth keeping in mind that's just shy of 3 seconds per required-project so like you say, it's mainly impacting jobs with a massive number of required-projects. A different approach might be to revisit the list of required-projects for that job and check whether they're all actually used.
On Thu, 4 Feb 2021, 17:29 Dan Smith, dms@danplanet.com wrote:
Hi all,
I have become increasingly concerned with CI performance lately, and have been raising those concerns with various people. Most specifically, I'm worried about our turnaround time or "time to get a result", which has been creeping up lately. Right after the beginning of the year, we had a really bad week where the turnaround time was well over 24 hours. That means if you submit a patch on Tuesday afternoon, you might not get a test result until Thursday. That is, IMHO, a real problem and massively hurts our ability to quickly merge priority fixes as well as just general velocity and morale. If people won't review my code until they see a +1 from Zuul, and that is two days after I submitted it, that's bad.
Thanks for looking into this Dan, it's definitely an important issue and can introduce a lot of friction into and already heavy development process.
Things have gotten a little better since that week, due in part to getting past a rush of new year submissions (we think) and also due to some job trimming in various places (thanks Neutron!). However, things are still not great. Being in almost the last timezone of the day, the queue is usually so full when I wake up that it's quite often I don't get to see a result before I stop working that day.
I would like to ask that projects review their jobs for places where they can cut out redundancy, as well as turn their eyes towards optimizations that can be made. I've been looking at both Nova and Glance jobs and have found some things I think we can do less of. I also wanted to get an idea of who is "using too much" in the way of resources, so I've been working on trying to characterize the weight of the jobs we run for a project, based on the number of worker nodes required to run all the jobs, as well as the wall clock time of how long we tie those up. The results are interesting, I think, and may help us to identify where we see some gains.
The idea here is to figure out[1] how many "node hours" it takes to run all the normal jobs on a Nova patch compared to, say, a Neutron one. If the jobs were totally serialized, this is the number of hours a single computer (of the size of a CI worker) would take to do all that work. If the number is 24 hours, that means a single computer could only check *one* patch in a day, running around the clock. I chose the top five projects in terms of usage[2] to report here, as they represent 70% of the total amount of resources consumed. The next five only add up to 13%, so the "top five" seems like a good target group. Here are the results, in order of total consumption:
Project % of total Node Hours Nodes ------------------------------------------ 1. TripleO 38% 31 hours 20 2. Neutron 13% 38 hours 32 3. Nova 9% 21 hours 25 4. Kolla 5% 12 hours 18 5. OSA 5% 22 hours 17
Acknowledging Kolla is in the top 5. Deployment projects certainly tend to consume resources. I'll raise this at our next meeting and see what we can come up with.
What that means is that a single computer (of the size of a CI worker)
couldn't even process the jobs required to run on a single patch for Neutron or TripleO in a 24-hour period. Now, we have lots of workers in the gate, of course, but there is also other potential overhead involved in that parallelism, like waiting for nodes to be available for dependent jobs. And of course, we'd like to be able to check more than patch per day. Most projects have smaller gate job sets than check, but assuming they are equivalent, a Neutron patch from submission to commit would undergo 76 hours of testing, not including revisions and not including rechecks. That's an enormous amount of time and resource for a single patch!
Now, obviously nobody wants to run fewer tests on patches before they land, and I'm not really suggesting that we take that approach necessarily. However, I think there are probably a lot of places that we can cut down the amount of *work* we do. Some ways to do this are:
- Evaluate whether or not you need to run all of tempest on two configurations of a devstack on each patch. Maybe having a stripped-down tempest (like just smoke) to run on unique configs, or even specific tests.
- Revisit your "irrelevant_files" lists to see where you might be able to avoid running heavy jobs on patches that only touch something small.
- Consider moving some jobs to the experimental queue and run them on-demand for patches that touch particular subsystems or affect particular configurations.
- Consider some periodic testing for things that maybe don't need to run on every single patch.
- Re-examine tests that take a long time to run to see if something can be done to make them more efficient.
- Consider performance improvements in the actual server projects, which also benefits the users.
7. Improve the reliability of jobs. Especially voting and gating ones. Rechecks increase resource usage and time to results/merge. I found querying the zuul API for failed jobs in the gate pipeline is a good way to find unexpected failures.
8. Reduce the node count in multi node jobs.
If you're a project that is not in the top ten then your job
configuration probably doesn't matter that much, since your usage is dwarfed by the heavy projects. If the heavy projects would consider making changes to decrease their workload, even small gains have the ability to multiply into noticeable improvement. The higher you are on the above list, the more impact a small change will have on the overall picture.
Also, thanks to Neutron and TripleO, both of which have already addressed this in some respect, and have other changes on the horizon.
Thanks for listening!
--Dan
1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/
Acknowledging Kolla is in the top 5. Deployment projects certainly tend to consume resources. I'll raise this at our next meeting and see what we can come up with.
Thanks - at least knowing and acknowledging is a great first step :)
- Improve the reliability of jobs. Especially voting and gating
ones. Rechecks increase resource usage and time to results/merge. I found querying the zuul API for failed jobs in the gate pipeline is a good way to find unexpected failures.
For sure, and thanks for pointing this out. As mentioned in the Neutron example, 70some hours becomes 140some hours if the patch needs a couple rechecks. Rechecks due to spurious job failures reduce capacity and increase latency for everyone.
- Reduce the node count in multi node jobs.
Yeah, I hope that people with three or more nodes in a job are doing so with lots of good reasoning, but this is an important point. Multi-node jobs consume N nodes for the full job runtime, but could be longer. If only some of the nodes are initially available, I believe zuul will spin those workers up and then wait for more, which means you are just burning node time not doing anything. I'm sure job configuration and other zuul details cause this to vary a lot (and I'm not an expert here), but it's good to note that fewer node counts will reduce the likelihood of the problem.
--Dan
On 2021-02-04 12:49:02 -0800 (-0800), Dan Smith wrote: [...]
If only some of the nodes are initially available, I believe zuul will spin those workers up and then wait for more, which means you are just burning node time not doing anything.
[...]
I can imagine some pathological situations where this might be the case occasionally, but for the most part they come up around the same time. At risk of diving into too much internal implementation detail, here's the typical process at work:
1. The Zuul scheduler determines that it needs to schedule a build of your job, checks the definition to determine how many of which sorts of nodes that will require, and then puts a node request into Zookeeper with those details.
2. A Nodepool launcher checks for pending requests in Zookeeper, sees the one for your queued build, and evaluates whether it has a provider with the right labels and sufficient available quota to satisfy this request (and if not, skips it in hopes another launcher can instead).
3. If that launcher decides to attempt to fulfil the request, it issues parallel server create calls in the provider it chose, then waits for them to become available and reachable over the Internet.
4. Once the booted nodes are reachable, the launcher returns the request in Zookeeper and the node records are locked for use in the assigned build until it completes.
Even our smallest providers have dozens of instances worth of capacity, and most multi-node jobs use only two or maybe three nodes for a build (granted I've seen some using five); so with the constant churn in builds completing and releasing spent nodes for deletion, there shouldn't be a significant amount of time spent where quota is consumed by some already active instances awaiting their compatriots for the same node request to also reach a ready state (though if the provider has a high incidence of boot failures, this becomes increasingly likely because some server create calls will need to be reissued).
Where this gets a little more complicated is with dependent jobs, as Zuul requires they all be satisfied from the same provider. Certainly a large set of interdependent multi-node jobs becomes harder to choose a provider for and needs to wait longer for enough capacity to be freed there.
On Thu, Feb 4, 2021 at 9:41 PM Mark Goddard mark@stackhpc.com wrote:
On Thu, 4 Feb 2021, 17:29 Dan Smith, dms@danplanet.com wrote:
Hi all,
I have become increasingly concerned with CI performance lately, and have been raising those concerns with various people. Most specifically, I'm worried about our turnaround time or "time to get a result", which has been creeping up lately. Right after the beginning of the year, we had a really bad week where the turnaround time was well over 24 hours. That means if you submit a patch on Tuesday afternoon, you might not get a test result until Thursday. That is, IMHO, a real problem and massively hurts our ability to quickly merge priority fixes as well as just general velocity and morale. If people won't review my code until they see a +1 from Zuul, and that is two days after I submitted it, that's bad.
Thanks for looking into this Dan, it's definitely an important issue and can introduce a lot of friction into and already heavy development process.
Things have gotten a little better since that week, due in part to getting past a rush of new year submissions (we think) and also due to some job trimming in various places (thanks Neutron!). However, things are still not great. Being in almost the last timezone of the day, the queue is usually so full when I wake up that it's quite often I don't get to see a result before I stop working that day.
I would like to ask that projects review their jobs for places where they can cut out redundancy, as well as turn their eyes towards optimizations that can be made. I've been looking at both Nova and Glance jobs and have found some things I think we can do less of. I also wanted to get an idea of who is "using too much" in the way of resources, so I've been working on trying to characterize the weight of the jobs we run for a project, based on the number of worker nodes required to run all the jobs, as well as the wall clock time of how long we tie those up. The results are interesting, I think, and may help us to identify where we see some gains.
The idea here is to figure out[1] how many "node hours" it takes to run all the normal jobs on a Nova patch compared to, say, a Neutron one. If the jobs were totally serialized, this is the number of hours a single computer (of the size of a CI worker) would take to do all that work. If the number is 24 hours, that means a single computer could only check *one* patch in a day, running around the clock. I chose the top five projects in terms of usage[2] to report here, as they represent 70% of the total amount of resources consumed. The next five only add up to 13%, so the "top five" seems like a good target group. Here are the results, in order of total consumption:
Project % of total Node Hours Nodes ------------------------------------------ 1. TripleO 38% 31 hours 20 2. Neutron 13% 38 hours 32 3. Nova 9% 21 hours 25 4. Kolla 5% 12 hours 18 5. OSA 5% 22 hours 17
Acknowledging Kolla is in the top 5. Deployment projects certainly tend to consume resources. I'll raise this at our next meeting and see what we can come up with.
What that means is that a single computer (of the size of a CI worker)
couldn't even process the jobs required to run on a single patch for Neutron or TripleO in a 24-hour period. Now, we have lots of workers in the gate, of course, but there is also other potential overhead involved in that parallelism, like waiting for nodes to be available for dependent jobs. And of course, we'd like to be able to check more than patch per day. Most projects have smaller gate job sets than check, but assuming they are equivalent, a Neutron patch from submission to commit would undergo 76 hours of testing, not including revisions and not including rechecks. That's an enormous amount of time and resource for a single patch!
Now, obviously nobody wants to run fewer tests on patches before they land, and I'm not really suggesting that we take that approach necessarily. However, I think there are probably a lot of places that we can cut down the amount of *work* we do. Some ways to do this are:
- Evaluate whether or not you need to run all of tempest on two configurations of a devstack on each patch. Maybe having a stripped-down tempest (like just smoke) to run on unique configs, or even specific tests.
- Revisit your "irrelevant_files" lists to see where you might be able to avoid running heavy jobs on patches that only touch something small.
- Consider moving some jobs to the experimental queue and run them on-demand for patches that touch particular subsystems or affect particular configurations.
- Consider some periodic testing for things that maybe don't need to run on every single patch.
- Re-examine tests that take a long time to run to see if something can be done to make them more efficient.
- Consider performance improvements in the actual server projects, which also benefits the users.
- Improve the reliability of jobs. Especially voting and gating ones.
Rechecks increase resource usage and time to results/merge. I found querying the zuul API for failed jobs in the gate pipeline is a good way to find unexpected failures.
7.1. Stop marking dependent patches with Verified-2 if their parent fails in the gate, keep them at Verified+1 (their previous state). This is a common source of unnecessary rechecks in the ironic land.
- Reduce the node count in multi node jobs.
If you're a project that is not in the top ten then your job
configuration probably doesn't matter that much, since your usage is dwarfed by the heavy projects. If the heavy projects would consider making changes to decrease their workload, even small gains have the ability to multiply into noticeable improvement. The higher you are on the above list, the more impact a small change will have on the overall picture.
Also, thanks to Neutron and TripleO, both of which have already addressed this in some respect, and have other changes on the horizon.
Thanks for listening!
--Dan
1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/
7.1. Stop marking dependent patches with Verified-2 if their parent fails in the gate, keep them at Verified+1 (their previous state). This is a common source of unnecessary rechecks in the ironic land.
Ooh, that's a good one. I'm guessing that may require more state in zuul. Although, maybe it could check to see if it has +1d that patchset before it -2s just for a parent fail.
--Dan
On 2021-02-05 22:52:15 +0100 (+0100), Dmitry Tantsur wrote: [...]
7.1. Stop marking dependent patches with Verified-2 if their parent fails in the gate, keep them at Verified+1 (their previous state). This is a common source of unnecessary rechecks in the ironic land.
[...]
Zuul generally assumes that if a change fails tests, it's going to need to be revised. Gerrit will absolutely refuse to allow a change to merge if its parent has been revised and the child has not been rebased onto that new revision. Revising or rebasing a change clears the Verified label and will require new test results. Which one or more of these conditions should be considered faulty? I'm guessing you're going to say it's the first one, that we shouldn't assume just because a change fails tests that means it needs to be fixed. This takes us back to the other subthread, wherein we entertain the notion that if changes have failing jobs and the changes themselves aren't at fault, then we should accept this as commonplace and lower our expectations.
Keep in mind that the primary source of pain here is one OpenStack has chosen. That is, the "clean check" requirement that a change get a +1 test result in the check pipeline before it can enter the gate pipeline. This is an arbitrary pipeline criterion, chosen to keep problematic changes from getting approved and making their way through the gate queue like a wrecking-ball, causing repeated test resets for the changes after them until they reach the front and Zuul is finally able to determine they're not just conflicting with other changes ahead. If a major pain for Ironic and other OpenStack projects is the need to revisit the check pipeline after a gate failure, that can be alleviated by dropping the clean check requirement.
Without clean check, a change which got a -2 in the gate could simply be enqueued directly back to the gate again. This is how it works in our other Zuul tenants. But the reason OpenStack started enforcing it is that reviewers couldn't be bothered to confirm changes really were reasonable, had *recent* passing check results, and confirmed that observed job failures were truly unrelated to the changes themselves.
On Sat, Feb 6, 2021 at 12:10 AM Jeremy Stanley fungi@yuggoth.org wrote:
On 2021-02-05 22:52:15 +0100 (+0100), Dmitry Tantsur wrote: [...]
7.1. Stop marking dependent patches with Verified-2 if their parent fails in the gate, keep them at Verified+1 (their previous state). This is a common source of unnecessary rechecks in the ironic land.
[...]
Zuul generally assumes that if a change fails tests, it's going to need to be revised.
Very unfortunately, it's far from being the case in the ironic world.
Gerrit will absolutely refuse to allow a change to merge if its parent has been revised and the child has not been rebased onto that new revision. Revising or rebasing a change clears the Verified label and will require new test results.
This is fair, I'm only referring to the case where the parent has to be rechecked because of a transient problem.
Which one or more of these conditions should be considered faulty? I'm guessing you're going to say it's the first one, that we shouldn't assume just because a change fails tests that means it needs to be fixed.
Unfortunately, yes.
A parallel proposal, that has been rejected numerous times, is to allow recheching only the failed jobs.
Dmitry
This takes us back to the other subthread, wherein we entertain the notion that if changes have failing jobs and the changes themselves aren't at fault, then we should accept this as commonplace and lower our expectations.
Keep in mind that the primary source of pain here is one OpenStack has chosen. That is, the "clean check" requirement that a change get a +1 test result in the check pipeline before it can enter the gate pipeline. This is an arbitrary pipeline criterion, chosen to keep problematic changes from getting approved and making their way through the gate queue like a wrecking-ball, causing repeated test resets for the changes after them until they reach the front and Zuul is finally able to determine they're not just conflicting with other changes ahead. If a major pain for Ironic and other OpenStack projects is the need to revisit the check pipeline after a gate failure, that can be alleviated by dropping the clean check requirement.
Without clean check, a change which got a -2 in the gate could simply be enqueued directly back to the gate again. This is how it works in our other Zuul tenants. But the reason OpenStack started enforcing it is that reviewers couldn't be bothered to confirm changes really were reasonable, had *recent* passing check results, and confirmed that observed job failures were truly unrelated to the changes themselves. -- Jeremy Stanley
Hi,
Dnia sobota, 6 lutego 2021 10:33:17 CET Dmitry Tantsur pisze:
On Sat, Feb 6, 2021 at 12:10 AM Jeremy Stanley fungi@yuggoth.org wrote:
On 2021-02-05 22:52:15 +0100 (+0100), Dmitry Tantsur wrote: [...]
7.1. Stop marking dependent patches with Verified-2 if their parent fails in the gate, keep them at Verified+1 (their previous state). This is a common source of unnecessary rechecks in the ironic land.
[...]
Zuul generally assumes that if a change fails tests, it's going to need to be revised.
Very unfortunately, it's far from being the case in the ironic world.
Gerrit will absolutely refuse to allow a change to merge if its parent has been revised and the child has not been rebased onto that new revision. Revising or rebasing a change clears the Verified label and will require new test results.
This is fair, I'm only referring to the case where the parent has to be rechecked because of a transient problem.
Which one or more of these conditions should be considered faulty? I'm guessing you're going to say it's the first one, that we shouldn't assume just because a change fails tests that means it needs to be fixed.
Unfortunately, yes.
A parallel proposal, that has been rejected numerous times, is to allow recheching only the failed jobs.
Even if I totally understand cons of that I would also be for such possibility. Maybe e.g. if only cores would have such possibility somehow would be good trade off?
Dmitry
This takes us back to the other subthread, wherein we entertain the notion that if changes have failing jobs and the changes themselves aren't at fault, then we should accept this as commonplace and lower our expectations.
Keep in mind that the primary source of pain here is one OpenStack has chosen. That is, the "clean check" requirement that a change get a +1 test result in the check pipeline before it can enter the gate pipeline. This is an arbitrary pipeline criterion, chosen to keep problematic changes from getting approved and making their way through the gate queue like a wrecking-ball, causing repeated test resets for the changes after them until they reach the front and Zuul is finally able to determine they're not just conflicting with other changes ahead. If a major pain for Ironic and other OpenStack projects is the need to revisit the check pipeline after a gate failure, that can be alleviated by dropping the clean check requirement.
Without clean check, a change which got a -2 in the gate could simply be enqueued directly back to the gate again. This is how it works in our other Zuul tenants. But the reason OpenStack started enforcing it is that reviewers couldn't be bothered to confirm changes really were reasonable, had *recent* passing check results, and confirmed that observed job failures were truly unrelated to the changes themselves. -- Jeremy Stanley
-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Sat, Feb 6, 2021 at 8:52 PM Slawek Kaplonski skaplons@redhat.com wrote:
Hi,
Dnia sobota, 6 lutego 2021 10:33:17 CET Dmitry Tantsur pisze:
On Sat, Feb 6, 2021 at 12:10 AM Jeremy Stanley fungi@yuggoth.org
wrote:
On 2021-02-05 22:52:15 +0100 (+0100), Dmitry Tantsur wrote: [...]
7.1. Stop marking dependent patches with Verified-2 if their parent fails in the gate, keep them at Verified+1 (their previous state). This is a common source of unnecessary rechecks in the ironic land.
[...]
Zuul generally assumes that if a change fails tests, it's going to need to be revised.
Very unfortunately, it's far from being the case in the ironic world.
Gerrit will absolutely refuse to allow a change to merge if its parent has been revised and the child has not been rebased onto that new revision. Revising or rebasing a change clears the Verified label and will require new test results.
This is fair, I'm only referring to the case where the parent has to be rechecked because of a transient problem.
Which one or more of these conditions should be considered faulty? I'm guessing you're going to say it's the first one, that we shouldn't assume just because a change fails tests that means it needs to be fixed.
Unfortunately, yes.
A parallel proposal, that has been rejected numerous times, is to allow recheching only the failed jobs.
Even if I totally understand cons of that I would also be for such possibility. Maybe e.g. if only cores would have such possibility somehow would be good trade off?
That would work for me.
Although currently there is an unfortunately tendency between newcomers to blindly recheck their patches despite clearly not passing some checks. If they could recheck only some jobs, it would limit their negative impact on the whole CI (and maybe make them realize that it's always the same jobs that fail).
Dmitry
Dmitry
This takes us back to the other subthread, wherein we entertain the notion that if changes have failing jobs and the changes themselves aren't at fault, then we should accept this as commonplace and lower our expectations.
Keep in mind that the primary source of pain here is one OpenStack has chosen. That is, the "clean check" requirement that a change get a +1 test result in the check pipeline before it can enter the gate pipeline. This is an arbitrary pipeline criterion, chosen to keep problematic changes from getting approved and making their way through the gate queue like a wrecking-ball, causing repeated test resets for the changes after them until they reach the front and Zuul is finally able to determine they're not just conflicting with other changes ahead. If a major pain for Ironic and other OpenStack projects is the need to revisit the check pipeline after a gate failure, that can be alleviated by dropping the clean check requirement.
Without clean check, a change which got a -2 in the gate could simply be enqueued directly back to the gate again. This is how it works in our other Zuul tenants. But the reason OpenStack started enforcing it is that reviewers couldn't be bothered to confirm changes really were reasonable, had *recent* passing check results, and confirmed that observed job failures were truly unrelated to the changes themselves. -- Jeremy Stanley
-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
-- Slawek Kaplonski Principal Software Engineer Red Hat
On 2021-02-07 14:58:14 +0100 (+0100), Dmitry Tantsur wrote: [...]
Although currently there is an unfortunately tendency between newcomers to blindly recheck their patches despite clearly not passing some checks. If they could recheck only some jobs, it would limit their negative impact on the whole CI (and maybe make them realize that it's always the same jobs that fail).
[...]
Put differently, there is a strong tendency for newcomers and long-timers alike to just beep blindly rechecking their buggy changed until they merge and introduce new nondeterministic behaviors into the software. If they only needed to recheck the specific jobs which failed on those bugs they're introducing one build at a time, it would become far easier for them to accomplish their apparent (judging from this habit) goal of making the software essentially unusable and untestable.
On Sun, Feb 7, 2021 at 3:12 PM Jeremy Stanley fungi@yuggoth.org wrote:
On 2021-02-07 14:58:14 +0100 (+0100), Dmitry Tantsur wrote: [...]
Although currently there is an unfortunately tendency between newcomers to blindly recheck their patches despite clearly not passing some checks. If they could recheck only some jobs, it would limit their negative impact on the whole CI (and maybe make them realize that it's always the same jobs that fail).
[...]
Put differently, there is a strong tendency for newcomers and long-timers alike to just beep blindly rechecking their buggy changed until they merge and introduce new nondeterministic behaviors into the software. If they only needed to recheck the specific jobs which failed on those bugs they're introducing one build at a time, it would become far easier for them to accomplish their apparent (judging from this habit) goal of making the software essentially unusable and untestable.
I cannot confirm your observation. In the cases I've seen it's a hard failure, completely deterministic, they just fail to recognize it.
In any case, leaving this right to cores only more or less fixes this concern.
Dmitry
-- Jeremy Stanley
On 2021-02-07 17:25:29 +0100 (+0100), Dmitry Tantsur wrote:
On Sun, Feb 7, 2021 at 3:12 PM Jeremy Stanley fungi@yuggoth.org wrote:
On 2021-02-07 14:58:14 +0100 (+0100), Dmitry Tantsur wrote: [...]
Although currently there is an unfortunately tendency between newcomers to blindly recheck their patches despite clearly not passing some checks. If they could recheck only some jobs, it would limit their negative impact on the whole CI (and maybe make them realize that it's always the same jobs that fail).
[...]
Put differently, there is a strong tendency for newcomers and long-timers alike to just beep blindly rechecking their buggy changed until they merge and introduce new nondeterministic behaviors into the software. If they only needed to recheck the specific jobs which failed on those bugs they're introducing one build at a time, it would become far easier for them to accomplish their apparent (judging from this habit) goal of making the software essentially unusable and untestable.
I cannot confirm your observation. In the cases I've seen it's a hard failure, completely deterministic, they just fail to recognize it.
In any case, leaving this right to cores only more or less fixes this concern.
Before we began enforcing a "clean check" rule with Zuul, there were many occasions where a ~50% failure condition was merged in some project, and upon digging into the origin it was discovered that the patch which introduced it actually failed at least once and was rechecked until it passed, then the prior rechecks were ignored by core reviewers who went on to approve the patch, and the author proceeded to recheck-spam it until it merged, repeatedly tripping tests on the same bug in the process.
Once changes were required to pass their full battery of tests twice in a row to merge, situations like this were drastically reduced in frequency.
On Sat, 2021-02-06 at 20:51 +0100, Slawek Kaplonski wrote:
Hi,
Dnia sobota, 6 lutego 2021 10:33:17 CET Dmitry Tantsur pisze:
On Sat, Feb 6, 2021 at 12:10 AM Jeremy Stanley fungi@yuggoth.org wrote:
On 2021-02-05 22:52:15 +0100 (+0100), Dmitry Tantsur wrote: [...]
7.1. Stop marking dependent patches with Verified-2 if their parent fails in the gate, keep them at Verified+1 (their previous state). This is a common source of unnecessary rechecks in the ironic land.
[...]
Zuul generally assumes that if a change fails tests, it's going to need to be revised.
Very unfortunately, it's far from being the case in the ironic world.
Gerrit will absolutely refuse to allow a change to merge if its parent has been revised and the child has not been rebased onto that new revision. Revising or rebasing a change clears the Verified label and will require new test results.
This is fair, I'm only referring to the case where the parent has to be rechecked because of a transient problem.
Which one or more of these conditions should be considered faulty? I'm guessing you're going to say it's the first one, that we shouldn't assume just because a change fails tests that means it needs to be fixed.
Unfortunately, yes.
A parallel proposal, that has been rejected numerous times, is to allow recheching only the failed jobs.
Even if I totally understand cons of that I would also be for such possibility. Maybe e.g. if only cores would have such possibility somehow would be good trade off?
it would require zuul to fundemtally be altered. currently triggers are defined at teh pipeline level we would have to instead define them per job. and im not sure restcting it to core would really help. it might but unless we force the same commit hashes to be reused so that all jobs used the same exact version fo the code i dont think it safe.
Dmitry
This takes us back to the other subthread, wherein we entertain the notion that if changes have failing jobs and the changes themselves aren't at fault, then we should accept this as commonplace and lower our expectations.
Keep in mind that the primary source of pain here is one OpenStack has chosen. That is, the "clean check" requirement that a change get a +1 test result in the check pipeline before it can enter the gate pipeline. This is an arbitrary pipeline criterion, chosen to keep problematic changes from getting approved and making their way through the gate queue like a wrecking-ball, causing repeated test resets for the changes after them until they reach the front and Zuul is finally able to determine they're not just conflicting with other changes ahead. If a major pain for Ironic and other OpenStack projects is the need to revisit the check pipeline after a gate failure, that can be alleviated by dropping the clean check requirement.
Without clean check, a change which got a -2 in the gate could simply be enqueued directly back to the gate again. This is how it works in our other Zuul tenants. But the reason OpenStack started enforcing it is that reviewers couldn't be bothered to confirm changes really were reasonable, had *recent* passing check results, and confirmed that observed job failures were truly unrelated to the changes themselves. -- Jeremy Stanley
-- Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Thu, Feb 4, 2021 at 7:30 PM Dan Smith dms@danplanet.com wrote:
Hi all,
I have become increasingly concerned with CI performance lately, and have been raising those concerns with various people. Most specifically, I'm worried about our turnaround time or "time to get a result", which has been creeping up lately. Right after the beginning of the year, we had a really bad week where the turnaround time was well over 24 hours. That means if you submit a patch on Tuesday afternoon, you might not get a test result until Thursday. That is, IMHO, a real problem and massively hurts our ability to quickly merge priority fixes as well as just general velocity and morale. If people won't review my code until they see a +1 from Zuul, and that is two days after I submitted it, that's bad.
Things have gotten a little better since that week, due in part to getting past a rush of new year submissions (we think) and also due to some job trimming in various places (thanks Neutron!). However, things are still not great. Being in almost the last timezone of the day, the queue is usually so full when I wake up that it's quite often I don't get to see a result before I stop working that day.
first thanks for bringing this topic - fully agreed that 24 hours before zuul reports back on a patch is unacceptable. The tripleo-ci team is *always* looking at improving CI efficiency, if nothing else for the very reason you started this thread i.e. we don't want so many jobs (or too many long jobs) that it takes 24 or more hours for zuul to report (ie this obviously affects us, too). We have been called out as a community on resource usage in the past so we are of course aware of, acknowledge and are trying to address the issue.
I would like to ask that projects review their jobs for places where they can cut out redundancy, as well as turn their eyes towards optimizations that can be made. I've been looking at both Nova and Glance jobs and have found some things I think we can do less of. I also wanted to get an idea of who is "using too much" in the way of resources, so I've been working on trying to characterize the weight of the jobs we run for a project, based on the number of worker nodes required to run all the jobs, as well as the wall clock time of how long we tie those up. The results are interesting, I think, and may help us to identify where we see some gains.
The idea here is to figure out[1] how many "node hours" it takes to run all the normal jobs on a Nova patch compared to, say, a Neutron one. If
just wanted to point out the 'node hours' comparison may not be fair because what is a typical nova patch or a typical tripleo patch? The number of jobs matched & executed by zuul on a given review will be different to another tripleo patch in the same repo depending on the files touched or branch (etc.) and will vary even more compared to other tripleo repos; I think this is the same for nova or any other project with multiple repos.
the jobs were totally serialized, this is the number of hours a single computer (of the size of a CI worker) would take to do all that work. If the number is 24 hours, that means a single computer could only check *one* patch in a day, running around the clock. I chose the top five projects in terms of usage[2] to report here, as they represent 70% of the total amount of resources consumed. The next five only add up to 13%, so the "top five" seems like a good target group. Here are the results, in order of total consumption:
Project % of total Node Hours Nodes ------------------------------------------ 1. TripleO 38% 31 hours 20 2. Neutron 13% 38 hours 32 3. Nova 9% 21 hours 25 4. Kolla 5% 12 hours 18 5. OSA 5% 22 hours 17
What that means is that a single computer (of the size of a CI worker) couldn't even process the jobs required to run on a single patch for Neutron or TripleO in a 24-hour period. Now, we have lots of workers in the gate, of course, but there is also other potential overhead involved in that parallelism, like waiting for nodes to be available for dependent jobs. And of course, we'd like to be able to check more than patch per day. Most projects have smaller gate job sets than check, but assuming they are equivalent, a Neutron patch from submission to commit would undergo 76 hours of testing, not including revisions and not including rechecks. That's an enormous amount of time and resource for a single patch!
Now, obviously nobody wants to run fewer tests on patches before they land, and I'm not really suggesting that we take that approach necessarily. However, I think there are probably a lot of places that we can cut down the amount of *work* we do. Some ways to do this are:
- Evaluate whether or not you need to run all of tempest on two configurations of a devstack on each patch. Maybe having a stripped-down tempest (like just smoke) to run on unique configs, or even specific tests.
- Revisit your "irrelevant_files" lists to see where you might be able to avoid running heavy jobs on patches that only touch something small.
- Consider moving some jobs to the experimental queue and run them on-demand for patches that touch particular subsystems or affect particular configurations.
- Consider some periodic testing for things that maybe don't need to run on every single patch.
- Re-examine tests that take a long time to run to see if something can be done to make them more efficient.
- Consider performance improvements in the actual server projects, which also benefits the users.
ACK. We have recently completed some work (as I said, this is an ongoing issue/process for us) at [1][2] to remove some redundant jobs which should start to help. Mohamed (mnaser o/) has reached out about this and joined our most recent irc meeting [3]. We're already prioritized some more cleanup work for this sprint including checking file patterns (e.g. started at [4]), tempest tests and removing many/all of our non-voting jobs as a first pass. Hope that at least starts to address you concern,
regards, marios
[1] https://review.opendev.org/q/topic:reduce-content-providers [2] https://review.opendev.org/q/topic:tripleo-c7-update-upgrade-removal [3] http://eavesdrop.openstack.org/meetings/tripleo/2021/tripleo.2021-02-02-14.0... [4] https://review.opendev.org/c/openstack/tripleo-ci/+/773692
If you're a project that is not in the top ten then your job configuration probably doesn't matter that much, since your usage is dwarfed by the heavy projects. If the heavy projects would consider making changes to decrease their workload, even small gains have the ability to multiply into noticeable improvement. The higher you are on the above list, the more impact a small change will have on the overall picture.
Also, thanks to Neutron and TripleO, both of which have already addressed this in some respect, and have other changes on the horizon.
Thanks for listening!
--Dan
1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/
just wanted to point out the 'node hours' comparison may not be fair because what is a typical nova patch or a typical tripleo patch? The number of jobs matched & executed by zuul on a given review will be different to another tripleo patch in the same repo depending on the files touched or branch (etc.) and will vary even more compared to other tripleo repos; I think this is the same for nova or any other project with multiple repos.
It is indeed important to note that some projects may have wildly different numbers depending on what is touched in the patch. Speaking from experience with Nova, Glance, and QA, most job runs are going to be the same for anything that touches code. Nova will only run unit or functional tests if those are the only files you touched, or docs if so, but otherwise we're pretty much running everything all the time, AFAIK.
That could be an area for improvement for us, although I think that determining the scope by the file changed is hard for us just because of how intertwined things are, so we probably need to figure out how to target our tests another way. And basically all of Nova is in a single repo. But yes, totally fair point. I picked a couple test runs at random to generate these numbers, based on looking like they were running most/all of what is configured. First time I did that I picked a stable Neutron patch from before they dropped some testing and got a sky-high number of 54h for a single patch run. So clearly it can vary :)
ACK. We have recently completed some work (as I said, this is an ongoing issue/process for us) at [1][2] to remove some redundant jobs which should start to help. Mohamed (mnaser o/) has reached out about this and joined our most recent irc meeting [3]. We're already prioritized some more cleanup work for this sprint including checking file patterns (e.g. started at [4]), tempest tests and removing many/all of our non-voting jobs as a first pass. Hope that at least starts to address you concern,
Yep, and thanks a lot for what you've done and continue to do. Obviously looking at the "tripleo is ~40%" report, I expected my script to show tripleo as having some insanely high test load. Looking at the actual numbers, it's clear that you're not only not the heaviest, but given what we know to be a super heavy process of deploying nodes like you do, seemingly relatively efficient. I'm sure there's still improvement that could be made on top of your current list, but I think the lesson in these numbers is that we definitely need to look elsewhere than the traditional openstack pastime of blaming tripleo ;)
For my part so far, I've got a stack of patches proposed to make devstack run quite a bit faster for jobs that use it:
https://review.opendev.org/q/topic:%2522async%2522+status:open+project:opens...
and I've also proposed that nova stop running two grenades which almost 100% overlap (which strangely has to be a change in the tempest repo):
https://review.opendev.org/c/openstack/tempest/+/771499
Both of these have barriers to approval at the moment, but both have big multipliers capable of making a difference.
--Dan
On 2/4/21 12:28 PM, Dan Smith wrote:
Hi all,
I have become increasingly concerned with CI performance lately, and have been raising those concerns with various people. Most specifically, I'm worried about our turnaround time or "time to get a result", which has been creeping up lately. Right after the beginning of the year, we had a really bad week where the turnaround time was well over 24 hours. That means if you submit a patch on Tuesday afternoon, you might not get a test result until Thursday. That is, IMHO, a real problem and massively hurts our ability to quickly merge priority fixes as well as just general velocity and morale. If people won't review my code until they see a +1 from Zuul, and that is two days after I submitted it, that's bad.
Thanks for raising the issue Dan, I've definitely been hit by this issue myself.
Now, obviously nobody wants to run fewer tests on patches before they land, and I'm not really suggesting that we take that approach necessarily. However, I think there are probably a lot of places that we can cut down the amount of *work* we do. Some ways to do this are:
- Evaluate whether or not you need to run all of tempest on two configurations of a devstack on each patch. Maybe having a stripped-down tempest (like just smoke) to run on unique configs, or even specific tests.
- Revisit your "irrelevant_files" lists to see where you might be able to avoid running heavy jobs on patches that only touch something small.
- Consider moving some jobs to the experimental queue and run them on-demand for patches that touch particular subsystems or affect particular configurations.
- Consider some periodic testing for things that maybe don't need to run on every single patch.
- Re-examine tests that take a long time to run to see if something can be done to make them more efficient.
- Consider performance improvements in the actual server projects, which also benefits the users.
There's another little used feature of Zuul called "fail fast", it's something used in the Octavia* repos in our gate jobs:
project: gate: fail-fast: true
Description is:
Zuul now supports :attr:`project.<pipeline>.fail-fast` to immediately report and cancel builds on the first failure in a buildset.
I feel it's useful for gate jobs since they've already gone through the check queue and typically shouldn't fail. For example, a mirror failure should stop things quickly, since the next action will most likely be a 'recheck' anyways.
And thinking along those lines, I remember a discussion years ago about having a 'canary' job, [0] (credit to Gmann and Jeremy). Is having a multi-stage pipeline where the 'low impact' jobs are run first - pep8, unit, functional, docs, and only if they pass run things like Tempest, more palatable now? I realize there are some downsides, but it mostly penalizes those that have failed to run the simple checks locally before pushing out a review. Just wanted to throw it out there.
-Brian
[0] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000755....
On 2021-02-05 11:02:46 -0500 (-0500), Brian Haley wrote: [...]
There's another little used feature of Zuul called "fail fast", it's something used in the Octavia* repos in our gate jobs:
project: gate: fail-fast: true
Description is:
Zuul now supports :attr:`project.<pipeline>.fail-fast` to immediately report and cancel builds on the first failure in a buildset.
I feel it's useful for gate jobs since they've already gone through the check queue and typically shouldn't fail. For example, a mirror failure should stop things quickly, since the next action will most likely be a 'recheck' anyways.
And thinking along those lines, I remember a discussion years ago about having a 'canary' job, [0] (credit to Gmann and Jeremy). Is having a multi-stage pipeline where the 'low impact' jobs are run first - pep8, unit, functional, docs, and only if they pass run things like Tempest, more palatable now? I realize there are some downsides, but it mostly penalizes those that have failed to run the simple checks locally before pushing out a review. Just wanted to throw it out there.
-Brian
[0] http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000755....
The fundamental downside to these sorts of defensive approaches is that they make it easier to avoid solving the underlying issues. We've designed Zuul to perform most efficiently when it's running tests which are deterministic and mostly free of "false negative" failures. Making your tests and the software being tested efficient and predictable maximizes CI throughput under such an optimistic model. Sinking engineering effort into workarounds for unstable tests and buggy software is time which could have been invested in improving things instead, but also to a great extent removes a lot of the incentive to bother.
Sure it could be seen as a pragmatic approach, accepting that in a large software ecosystem such seemingly pathological problems are actually inevitable, but that strikes me as a bit defeatist. There will of course always be temporary problems resulting from outages/incidents in donated resources or regressions in external dependencies outside our control, but if our background failure rate was significantly reduced it would also be far easier to spot and mitigate an order of magnitude failure increase quickly, rather than trying to find the cause of a sudden 25% uptick in failures.
The fundamental downside to these sorts of defensive approaches is that they make it easier to avoid solving the underlying issues.
We certainly don't want to incentivize relying on aggregate throughput in place of actually making things faster and better. That's why I started this thread. However...
We've designed Zuul to perform most efficiently when it's running tests which are deterministic and mostly free of "false negative" failures. Making your tests and the software being tested efficient and predictable maximizes CI throughput under such an optimistic model.
This is a nice ideal and definitely what we should strive for, no doubt. But I think it's pretty clear that what we're doing here is hard, with potential failures at all layers above and below a thing you're working on at any given point. Striving to get there and expecting we ever will are very different.
I remember back when we moved from serialized tests to parallel ones, there was a lot of concern over being able to reproduce a test failure that only occasionally happens due to ordering. The benefit of running in parallel greatly outweighs the cost of not doing so. Still today, it is incredibly time consuming to reproduce, debug and fix issues that come from running in parallel. Our tests are more complicated (but better of course) because of it, and just yesterday I -1'd a patch because I could spot some non-reentrant behavior it was proposing to add. In terms of aggregate performance, we get far more done I'm sure with parallelized tests along with some increased spurious failure rate, over a very low failure rate and serialized tests.
Sinking engineering effort into workarounds for unstable tests and buggy software is time which could have been invested in improving things instead, but also to a great extent removes a lot of the incentive to bother.
Like everything, it's a tradeoff. If we didn't run in parallel, we'd waste a lot more gate resources in serial, but we would almost definitely have to recheck less, our tests could be a lot simpler and we could spend time (and be rewarded in test execution) by making the actual servers faster instead of debugging failures. You might even argue that such an arrangement would benefit the users more than making our tests capable of running in parallel ;)
Sure it could be seen as a pragmatic approach, accepting that in a large software ecosystem such seemingly pathological problems are actually inevitable, but that strikes me as a bit defeatist. There will of course always be temporary problems resulting from outages/incidents in donated resources or regressions in external dependencies outside our control, but if our background failure rate was significantly reduced it would also be far easier to spot and mitigate an order of magnitude failure increase quickly, rather than trying to find the cause of a sudden 25% uptick in failures.
Looking back on the eight years I've been doing this, I really don't think that zero fails is realistic or even useful as a goal, unless it's your only goal. Thus, I expect we're always going to be ticking up or down over time. Debugging and fixing the non-trivial things that plague us is some of the harder work we do, more so in almost all cases than the work we did that introduced the problem in the first place. We definitely need to be constantly trying to increase stability, but let's be clear that it is likely the _most_ difficult think a stacker can do with their time.
--Dan
This seemed like a good time to finally revisit https://review.opendev.org/c/openstack/devstack/+/676016 (the OSC as a service patch). Turns out it wasn't as much work to reimplement as I had expected, but hopefully this version addresses the concerns with the old one.
In my local env it takes about 3:45 off my devstack run. Not a huge amount by itself, but multiplied by thousands of jobs it could be significant.
On 2/4/21 11:28 AM, Dan Smith wrote:
Hi all,
I have become increasingly concerned with CI performance lately, and have been raising those concerns with various people. Most specifically, I'm worried about our turnaround time or "time to get a result", which has been creeping up lately. Right after the beginning of the year, we had a really bad week where the turnaround time was well over 24 hours. That means if you submit a patch on Tuesday afternoon, you might not get a test result until Thursday. That is, IMHO, a real problem and massively hurts our ability to quickly merge priority fixes as well as just general velocity and morale. If people won't review my code until they see a +1 from Zuul, and that is two days after I submitted it, that's bad.
Things have gotten a little better since that week, due in part to getting past a rush of new year submissions (we think) and also due to some job trimming in various places (thanks Neutron!). However, things are still not great. Being in almost the last timezone of the day, the queue is usually so full when I wake up that it's quite often I don't get to see a result before I stop working that day.
I would like to ask that projects review their jobs for places where they can cut out redundancy, as well as turn their eyes towards optimizations that can be made. I've been looking at both Nova and Glance jobs and have found some things I think we can do less of. I also wanted to get an idea of who is "using too much" in the way of resources, so I've been working on trying to characterize the weight of the jobs we run for a project, based on the number of worker nodes required to run all the jobs, as well as the wall clock time of how long we tie those up. The results are interesting, I think, and may help us to identify where we see some gains.
The idea here is to figure out[1] how many "node hours" it takes to run all the normal jobs on a Nova patch compared to, say, a Neutron one. If the jobs were totally serialized, this is the number of hours a single computer (of the size of a CI worker) would take to do all that work. If the number is 24 hours, that means a single computer could only check *one* patch in a day, running around the clock. I chose the top five projects in terms of usage[2] to report here, as they represent 70% of the total amount of resources consumed. The next five only add up to 13%, so the "top five" seems like a good target group. Here are the results, in order of total consumption:
Project % of total Node Hours Nodes ------------------------------------------ 1. TripleO 38% 31 hours 20 2. Neutron 13% 38 hours 32 3. Nova 9% 21 hours 25 4. Kolla 5% 12 hours 18 5. OSA 5% 22 hours 17
What that means is that a single computer (of the size of a CI worker) couldn't even process the jobs required to run on a single patch for Neutron or TripleO in a 24-hour period. Now, we have lots of workers in the gate, of course, but there is also other potential overhead involved in that parallelism, like waiting for nodes to be available for dependent jobs. And of course, we'd like to be able to check more than patch per day. Most projects have smaller gate job sets than check, but assuming they are equivalent, a Neutron patch from submission to commit would undergo 76 hours of testing, not including revisions and not including rechecks. That's an enormous amount of time and resource for a single patch!
Now, obviously nobody wants to run fewer tests on patches before they land, and I'm not really suggesting that we take that approach necessarily. However, I think there are probably a lot of places that we can cut down the amount of *work* we do. Some ways to do this are:
- Evaluate whether or not you need to run all of tempest on two configurations of a devstack on each patch. Maybe having a stripped-down tempest (like just smoke) to run on unique configs, or even specific tests.
- Revisit your "irrelevant_files" lists to see where you might be able to avoid running heavy jobs on patches that only touch something small.
- Consider moving some jobs to the experimental queue and run them on-demand for patches that touch particular subsystems or affect particular configurations.
- Consider some periodic testing for things that maybe don't need to run on every single patch.
- Re-examine tests that take a long time to run to see if something can be done to make them more efficient.
- Consider performance improvements in the actual server projects, which also benefits the users.
If you're a project that is not in the top ten then your job configuration probably doesn't matter that much, since your usage is dwarfed by the heavy projects. If the heavy projects would consider making changes to decrease their workload, even small gains have the ability to multiply into noticeable improvement. The higher you are on the above list, the more impact a small change will have on the overall picture.
Also, thanks to Neutron and TripleO, both of which have already addressed this in some respect, and have other changes on the horizon.
Thanks for listening!
--Dan
1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/
This seemed like a good time to finally revisit https://review.opendev.org/c/openstack/devstack/+/676016 (the OSC as a service patch). Turns out it wasn't as much work to reimplement as I had expected, but hopefully this version addresses the concerns with the old one.
In my local env it takes about 3:45 off my devstack run. Not a huge amount by itself, but multiplied by thousands of jobs it could be significant.
I messed with doing this myself, I wish I had seen yours first. I never really got it to be stable enough to consider it usable because of how many places in devstack we use the return code of an osc command. I could get it to trivially work, but re-stacks and other behaviors weren't quite right. Looks like maybe your version does that properly?
Anyway, I moved on to a full parallelization of devstack, which largely lets me run all the non-dependent osc commands in parallel, in addition to all kinds of other stuff (like db syncs and various project setup). So far, that effort is giving me about a 60% performance improvement over baseline, and I can do a minimal stack on my local machine in about five minutes:
https://review.opendev.org/c/openstack/devstack/+/771505/
I think we've largely got agreement to get that merged at this point, which as you say, will definitely make some significant improvements purely because of how many times we do that in a day. If your OaaS can support parallel requests, I'd definitely be interested in pursuing that on top, although I think I've largely squeezed out the startup delay we see when we run like eight osc instances in parallel during keystone setup :)
--Dan
On 2/9/21 6:59 PM, Dan Smith wrote:
This seemed like a good time to finally revisit https://review.opendev.org/c/openstack/devstack/+/676016 (the OSC as a service patch). Turns out it wasn't as much work to reimplement as I had expected, but hopefully this version addresses the concerns with the old one.
In my local env it takes about 3:45 off my devstack run. Not a huge amount by itself, but multiplied by thousands of jobs it could be significant.
I messed with doing this myself, I wish I had seen yours first. I never really got it to be stable enough to consider it usable because of how many places in devstack we use the return code of an osc command. I could get it to trivially work, but re-stacks and other behaviors weren't quite right. Looks like maybe your version does that properly?
It seems to. I had an issue at one point when I wasn't shutting down the systemd service during unstack, but I haven't seen any problems since I fixed that. I've done quite a few devstack runs on the same node with no failures.
Anyway, I moved on to a full parallelization of devstack, which largely lets me run all the non-dependent osc commands in parallel, in addition to all kinds of other stuff (like db syncs and various project setup). So far, that effort is giving me about a 60% performance improvement over baseline, and I can do a minimal stack on my local machine in about five minutes:
Ah, that's nice. The speedup from the parallel execution series alone was pretty comparable to just the client service in my (old and slow) env.
I think we've largely got agreement to get that merged at this point, which as you say, will definitely make some significant improvements purely because of how many times we do that in a day. If your OaaS can support parallel requests, I'd definitely be interested in pursuing that on top, although I think I've largely squeezed out the startup delay we see when we run like eight osc instances in parallel during keystone setup :)
Surprisingly, it does seem to work. I suspect it serializes handling the multiple client calls, but it works and is still faster than just the parallel patch alone (again, in my env). The client service took about a minute off the parallel runtime.
Here's the timing I see locally: Vanilla devstack: 775 Client service alone: 529 Parallel execution: 527 Parallel client service: 465
Most of the difference between the last two is shorter async_wait times because the deployment steps are taking less time. So not quite as much as before, but still a decent increase in speed.
Here's the timing I see locally: Vanilla devstack: 775 Client service alone: 529 Parallel execution: 527 Parallel client service: 465
Most of the difference between the last two is shorter async_wait times because the deployment steps are taking less time. So not quite as much as before, but still a decent increase in speed.
Yeah, cool, I think you're right that we'll just serialize the calls. It may not be worth the complexity, but if we make the OaaS server able to do a few things in parallel, then we'll re-gain a little more perf because we'll go back to overlapping the *server* side of things. Creating flavors, volume types, networks and uploading the image to glance are all things that should be doable in parallel in the server projects.
465s for a devstack is awesome. Think of all the developer time in $local_fiat_currency we could have saved if we did this four years ago... :)
--Dan
On Wed, Feb 10, 2021 at 1:05 PM Dan Smith dms@danplanet.com wrote:
Here's the timing I see locally: Vanilla devstack: 775 Client service alone: 529 Parallel execution: 527 Parallel client service: 465
Most of the difference between the last two is shorter async_wait times because the deployment steps are taking less time. So not quite as much as before, but still a decent increase in speed.
Yeah, cool, I think you're right that we'll just serialize the calls. It may not be worth the complexity, but if we make the OaaS server able to do a few things in parallel, then we'll re-gain a little more perf because we'll go back to overlapping the *server* side of things. Creating flavors, volume types, networks and uploading the image to glance are all things that should be doable in parallel in the server projects.
465s for a devstack is awesome. Think of all the developer time in $local_fiat_currency we could have saved if we did this four years ago... :)
--Dan
Hey folks, Just wanted to check back in on the resource consumption topic. Looking at my measurements the TripleO group has made quite a bit of progress keeping our enqued zuul time lower than our historical average. Do you think we can measure where things stand now and have some new numbers available at the PTG?
/me notes we had a blip on 3/25 but there was a one off issue w/ nodepool in our gate.
Marios Andreou has put a lot of time into this, and others as well. Kudo's Marios! Thanks all!
Hi Wes,
Just wanted to check back in on the resource consumption topic. Looking at my measurements the TripleO group has made quite a bit of progress keeping our enqued zuul time lower than our historical average. Do you think we can measure where things stand now and have some new numbers available at the PTG?
Yeah, in the last few TC meetings I've been saying things like "let's not sample right now because we're in such a weird high-load situation with the release" and "...but we seem to be chewing through a lot of patches, so things seem better." I definitely think the changes made by tripleo and others are helping. Life definitely "feels" better lately. I'll try to circle back and generate a new set of numbers with my script, and also see if I can get updated numbers from Clark on the overall percentages.
Thanks!
--Dan
On Wed, Mar 31, 2021 at 6:00 PM Dan Smith dms@danplanet.com wrote:
Hi Wes,
Just wanted to check back in on the resource consumption topic. Looking at my measurements the TripleO group has made quite a bit of progress keeping our enqued zuul time lower than our historical average. Do you think we can measure where things stand now and have some new numbers available at the PTG?
Yeah, in the last few TC meetings I've been saying things like "let's not sample right now because we're in such a weird high-load situation with the release" and "...but we seem to be chewing through a lot of patches, so things seem better." I definitely think the changes made by tripleo and others are helping. Life definitely "feels" better lately. I'll try to circle back and generate a new set of numbers with my script, and also see if I can get updated numbers from Clark on the overall percentages.
Thanks!
--Dan
Sounds good.. I'm keeping an eye in the meantime w/
http://dashboard-ci.tripleo.org/d/Z4vLSmOGk/cockpit?viewPanel=71&orgId=1
SELECT max("enqueued_time") FROM "zuul-queue-status" WHERE ("pipeline" = 'gate' AND "queue" = 'tripleo')
and http://dashboard-ci.tripleo.org/d/Z4vLSmOGk/cockpit?viewPanel=398&orgId=...
SELECT max("enqueued_time") FROM "zuul-queue-status" WHERE ("pipeline" = 'gate') AND time >= 1601514835817ms GROUP BY time(10m) fill(0);SELECT mean("enqueued_time") FROM "zuul-queue-status" WHERE ("pipeline" = 'check') AND time >= 1601514835817ms GROUP BY time(10m) fill(0)
0/
I'll try to circle back and generate a new set of numbers with my script, and also see if I can get updated numbers from Clark on the overall percentages.
Okay, I re-ran the numbers this morning and got updated 30-day stats from Clark. Here's what I've got (delta from the last report in parens):
Project % of total Node Hours Nodes ---------------------------------------------- 1. Neutron 23% 34h (-4) 30 (-2) 2. TripleO 18% 17h (-14) 14 (-6) 3. Nova 7% 22h (+1) 25 (-0) 4. Kolla 6% 10h (-2) 18 (-0) 5. OSA 6% 19h (-3) 16 (-1)
Definitely a lot of improvement from tripleo, so thanks for that! Neutron rose to the top and is still very hefty. I think Nova's 1-hr rise is probably just noise given the node count didn't change. I think we're still waiting on zuulv3 conversion of the grenade multinode job so we can drop the base grenade job, which will make things go down.
I've also got a proposal to make devstack parallel mode be the default, but we're waiting until after devstack cuts wallaby to proceed with that. Hopefully that will result in some across-the-board reduction.
Anyway, definitely moving in the right direction on all fronts, so thanks a lot to everyone who has made efforts in this area. I think once things really kick back up around/after PTG we should measure again and see if the "quality of life" is reasonable, and if not, revisit the numbers in terms of who to lean on to reduce further.
--Dan
On Thu, 1 Apr 2021 at 18:53, Dan Smith dms@danplanet.com wrote:
I'll try to circle back and generate a new set of numbers with my script, and also see if I can get updated numbers from Clark on the overall percentages.
Okay, I re-ran the numbers this morning and got updated 30-day stats from Clark. Here's what I've got (delta from the last report in parens):
Project % of total Node Hours Nodes ---------------------------------------------- 1. Neutron 23% 34h (-4) 30 (-2) 2. TripleO 18% 17h (-14) 14 (-6) 3. Nova 7% 22h (+1) 25 (-0) 4. Kolla 6% 10h (-2) 18 (-0) 5. OSA 6% 19h (-3) 16 (-1)
Definitely a lot of improvement from tripleo, so thanks for that! Neutron rose to the top and is still very hefty. I think Nova's 1-hr rise is probably just noise given the node count didn't change. I think we're still waiting on zuulv3 conversion of the grenade multinode job so we can drop the base grenade job, which will make things go down.
Thanks Dan,
I've recently introduced a standalone nova-live-migration-ceph job that might be the cause of the additional hour for Nova.
zuul: Add nova-live-migration-ceph job https://review.opendev.org/c/openstack/nova/+/768466
While this adds extra load it should be easier to maintain over the previous all in one live migration job that restacked the environment by making direct calls into various devstack plugins.
Regarding the switch to the multinode grenade job I'm still working through that below and wanted to land it once Xena is formally open:
zuul: Replace grenade and nova-grenade-multinode with grenade-multinode https://review.opendev.org/c/openstack/nova/+/778885/
This series also includes some attempted cleanup of our irrelevant-files for a few jobs that will hopefully reduce our numbers further. Plenty of work left to do here throughout Xena but it's a start.
Cheers,
Lee
On Wed, Mar 31, 2021 at 8:04 PM Wesley Hayutin whayutin@redhat.com wrote:
On Wed, Feb 10, 2021 at 1:05 PM Dan Smith dms@danplanet.com wrote:
Here's the timing I see locally: Vanilla devstack: 775 Client service alone: 529 Parallel execution: 527 Parallel client service: 465
Most of the difference between the last two is shorter async_wait times because the deployment steps are taking less time. So not quite as much as before, but still a decent increase in speed.
Yeah, cool, I think you're right that we'll just serialize the calls. It may not be worth the complexity, but if we make the OaaS server able to do a few things in parallel, then we'll re-gain a little more perf because we'll go back to overlapping the *server* side of things. Creating flavors, volume types, networks and uploading the image to glance are all things that should be doable in parallel in the server projects.
465s for a devstack is awesome. Think of all the developer time in $local_fiat_currency we could have saved if we did this four years ago... :)
--Dan
Hey folks, Just wanted to check back in on the resource consumption topic. Looking at my measurements the TripleO group has made quite a bit of progress keeping our enqued zuul time lower than our historical average. Do you think we can measure where things stand now and have some new numbers available at the PTG?
/me notes we had a blip on 3/25 but there was a one off issue w/ nodepool in our gate.
Marios Andreou has put a lot of time into this, and others as well. Kudo's Marios! Thanks all!
o/ thanks for the shout out ;)
Big thanks to Sagi (sshnaidm), Chandan (chkumar), Wes (weshay), Alex (mwhahaha) and everyone else who helped us merge those things https://review.opendev.org/q/topic:tripleo-ci-reduce - things like tightening files/irrelevant_files matches, removal of older/non voting jobs, removal of upgrade master jobs and removal of layout overrides across tripleo repos (using the centralised tripleo-ci repo templates everywhere instead) to make maintenance easier so it is more likely that we will notice and fix new issues moving forward
regards, marios
participants (12)
-
Ben Nemec
-
Brian Haley
-
Dan Smith
-
Dmitriy Rabotyagov
-
Dmitry Tantsur
-
Jeremy Stanley
-
Lee Yarwood
-
Marios Andreou
-
Mark Goddard
-
Sean Mooney
-
Slawek Kaplonski
-
Wesley Hayutin