[all] Gate resources and performance

Ben Nemec openstack at nemebean.com
Wed Feb 10 00:08:29 UTC 2021

This seemed like a good time to finally revisit 
https://review.opendev.org/c/openstack/devstack/+/676016 (the OSC as a 
service patch). Turns out it wasn't as much work to reimplement as I had 
expected, but hopefully this version addresses the concerns with the old 

In my local env it takes about 3:45 off my devstack run. Not a huge 
amount by itself, but multiplied by thousands of jobs it could be 

On 2/4/21 11:28 AM, Dan Smith wrote:
> Hi all,
> I have become increasingly concerned with CI performance lately, and
> have been raising those concerns with various people. Most specifically,
> I'm worried about our turnaround time or "time to get a result", which
> has been creeping up lately. Right after the beginning of the year, we
> had a really bad week where the turnaround time was well over 24
> hours. That means if you submit a patch on Tuesday afternoon, you might
> not get a test result until Thursday. That is, IMHO, a real problem and
> massively hurts our ability to quickly merge priority fixes as well as
> just general velocity and morale. If people won't review my code until
> they see a +1 from Zuul, and that is two days after I submitted it,
> that's bad.
> Things have gotten a little better since that week, due in part to
> getting past a rush of new year submissions (we think) and also due to
> some job trimming in various places (thanks Neutron!). However, things
> are still not great. Being in almost the last timezone of the day, the
> queue is usually so full when I wake up that it's quite often I don't
> get to see a result before I stop working that day.
> I would like to ask that projects review their jobs for places where
> they can cut out redundancy, as well as turn their eyes towards
> optimizations that can be made. I've been looking at both Nova and
> Glance jobs and have found some things I think we can do less of. I also
> wanted to get an idea of who is "using too much" in the way of
> resources, so I've been working on trying to characterize the weight of
> the jobs we run for a project, based on the number of worker nodes
> required to run all the jobs, as well as the wall clock time of how long
> we tie those up. The results are interesting, I think, and may help us
> to identify where we see some gains.
> The idea here is to figure out[1] how many "node hours" it takes to run
> all the normal jobs on a Nova patch compared to, say, a Neutron one. If
> the jobs were totally serialized, this is the number of hours a single
> computer (of the size of a CI worker) would take to do all that work. If
> the number is 24 hours, that means a single computer could only check
> *one* patch in a day, running around the clock. I chose the top five
> projects in terms of usage[2] to report here, as they represent 70% of
> the total amount of resources consumed. The next five only add up to
> 13%, so the "top five" seems like a good target group. Here are the
> results, in order of total consumption:
>      Project     % of total  Node Hours  Nodes
>      ------------------------------------------
>      1. TripleO    38%       31 hours     20
>      2. Neutron    13%       38 hours     32
>      3. Nova       9%        21 hours     25
>      4. Kolla      5%        12 hours     18
>      5. OSA        5%        22 hours     17
> What that means is that a single computer (of the size of a CI worker)
> couldn't even process the jobs required to run on a single patch for
> Neutron or TripleO in a 24-hour period. Now, we have lots of workers in
> the gate, of course, but there is also other potential overhead involved
> in that parallelism, like waiting for nodes to be available for
> dependent jobs. And of course, we'd like to be able to check more than
> patch per day. Most projects have smaller gate job sets than check, but
> assuming they are equivalent, a Neutron patch from submission to commit
> would undergo 76 hours of testing, not including revisions and not
> including rechecks. That's an enormous amount of time and resource for a
> single patch!
> Now, obviously nobody wants to run fewer tests on patches before they
> land, and I'm not really suggesting that we take that approach
> necessarily. However, I think there are probably a lot of places that we
> can cut down the amount of *work* we do. Some ways to do this are:
> 1. Evaluate whether or not you need to run all of tempest on two
>     configurations of a devstack on each patch. Maybe having a
>     stripped-down tempest (like just smoke) to run on unique configs, or
>     even specific tests.
> 2. Revisit your "irrelevant_files" lists to see where you might be able
>     to avoid running heavy jobs on patches that only touch something
>     small.
> 3. Consider moving some jobs to the experimental queue and run them
>     on-demand for patches that touch particular subsystems or affect
>     particular configurations.
> 4. Consider some periodic testing for things that maybe don't need to
>     run on every single patch.
> 5. Re-examine tests that take a long time to run to see if something can
>     be done to make them more efficient.
> 6. Consider performance improvements in the actual server projects,
>     which also benefits the users.
> If you're a project that is not in the top ten then your job
> configuration probably doesn't matter that much, since your usage is
> dwarfed by the heavy projects. If the heavy projects would consider
> making changes to decrease their workload, even small gains have the
> ability to multiply into noticeable improvement. The higher you are on
> the above list, the more impact a small change will have on the overall
> picture.
> Also, thanks to Neutron and TripleO, both of which have already
> addressed this in some respect, and have other changes on the horizon.
> Thanks for listening!
> --Dan
> 1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c
> 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/

More information about the openstack-discuss mailing list