Another thing that I think may help us saving some CI time and that affects most of the projects is pyenv build. There was a change made to the zuul jobs that implements usage of stow. So we spend time on building all major python version in images and doing instant select of valid binary in jobs, rather then waiting for pyenv build during in pipelines.
 
Both element [1] and job [2] are landed, but I just hadn't time to suggest patching to switch to it's usage. It should be pretty strightforward to switch to this as it worked nicely internaly for me.
 
[1] https://review.opendev.org/c/openstack/diskimage-builder/+/713692
[2] https://review.opendev.org/c/zuul/zuul-jobs/+/751611
 
04.02.2021, 22:39, "Dmitriy Rabotyagov" <noonedeadpunk@ya.ru>:

Hi!

For OSA huge issue is how zuul clones required-projects. Just this single action takes for us from 6 to 10 minutes. It's not _so_ big amount of time but pretty fair, considering that we can decrease it for each CI job. Moreover, I think we're not alone who has more several repos in required-projects

And maybe we have some kind of solution, which is ansible module [1] for parallel git clone. It speeds up process dramatically from what we see in our non-ci deployments. But it needs some time and resources for integration into zuul and I don't think we will be able to spend a lot of time on it during this cycle.

Also we can probably decrease coverage for some operating systems, but we're about testing minimum of the required stuff and user scenarios out of possible amount of them. I will still try to drop something to the experimental pipeline though.

[1] https://opendev.org/openstack/openstack-ansible/src/branch/master/playbooks/library/git_requirements.py

04.02.2021, 19:35, "Dan Smith" <dms@danplanet.com>:

 Hi all,

 I have become increasingly concerned with CI performance lately, and
 have been raising those concerns with various people. Most specifically,
 I'm worried about our turnaround time or "time to get a result", which
 has been creeping up lately. Right after the beginning of the year, we
 had a really bad week where the turnaround time was well over 24
 hours. That means if you submit a patch on Tuesday afternoon, you might
 not get a test result until Thursday. That is, IMHO, a real problem and
 massively hurts our ability to quickly merge priority fixes as well as
 just general velocity and morale. If people won't review my code until
 they see a +1 from Zuul, and that is two days after I submitted it,
 that's bad.

 Things have gotten a little better since that week, due in part to
 getting past a rush of new year submissions (we think) and also due to
 some job trimming in various places (thanks Neutron!). However, things
 are still not great. Being in almost the last timezone of the day, the
 queue is usually so full when I wake up that it's quite often I don't
 get to see a result before I stop working that day.

 I would like to ask that projects review their jobs for places where
 they can cut out redundancy, as well as turn their eyes towards
 optimizations that can be made. I've been looking at both Nova and
 Glance jobs and have found some things I think we can do less of. I also
 wanted to get an idea of who is "using too much" in the way of
 resources, so I've been working on trying to characterize the weight of
 the jobs we run for a project, based on the number of worker nodes
 required to run all the jobs, as well as the wall clock time of how long
 we tie those up. The results are interesting, I think, and may help us
 to identify where we see some gains.

 The idea here is to figure out[1] how many "node hours" it takes to run
 all the normal jobs on a Nova patch compared to, say, a Neutron one. If
 the jobs were totally serialized, this is the number of hours a single
 computer (of the size of a CI worker) would take to do all that work. If
 the number is 24 hours, that means a single computer could only check
 *one* patch in a day, running around the clock. I chose the top five
 projects in terms of usage[2] to report here, as they represent 70% of
 the total amount of resources consumed. The next five only add up to
 13%, so the "top five" seems like a good target group. Here are the
 results, in order of total consumption:

     Project % of total Node Hours Nodes
     ------------------------------------------
     1. TripleO 38% 31 hours 20
     2. Neutron 13% 38 hours 32
     3. Nova 9% 21 hours 25
     4. Kolla 5% 12 hours 18
     5. OSA 5% 22 hours 17

 What that means is that a single computer (of the size of a CI worker)
 couldn't even process the jobs required to run on a single patch for
 Neutron or TripleO in a 24-hour period. Now, we have lots of workers in
 the gate, of course, but there is also other potential overhead involved
 in that parallelism, like waiting for nodes to be available for
 dependent jobs. And of course, we'd like to be able to check more than
 patch per day. Most projects have smaller gate job sets than check, but
 assuming they are equivalent, a Neutron patch from submission to commit
 would undergo 76 hours of testing, not including revisions and not
 including rechecks. That's an enormous amount of time and resource for a
 single patch!

 Now, obviously nobody wants to run fewer tests on patches before they
 land, and I'm not really suggesting that we take that approach
 necessarily. However, I think there are probably a lot of places that we
 can cut down the amount of *work* we do. Some ways to do this are:

 1. Evaluate whether or not you need to run all of tempest on two
    configurations of a devstack on each patch. Maybe having a
    stripped-down tempest (like just smoke) to run on unique configs, or
    even specific tests.
 2. Revisit your "irrelevant_files" lists to see where you might be able
    to avoid running heavy jobs on patches that only touch something
    small.
 3. Consider moving some jobs to the experimental queue and run them
    on-demand for patches that touch particular subsystems or affect
    particular configurations.
 4. Consider some periodic testing for things that maybe don't need to
    run on every single patch.
 5. Re-examine tests that take a long time to run to see if something can
    be done to make them more efficient.
 6. Consider performance improvements in the actual server projects,
    which also benefits the users.

 If you're a project that is not in the top ten then your job
 configuration probably doesn't matter that much, since your usage is
 dwarfed by the heavy projects. If the heavy projects would consider
 making changes to decrease their workload, even small gains have the
 ability to multiply into noticeable improvement. The higher you are on
 the above list, the more impact a small change will have on the overall
 picture.

 Also, thanks to Neutron and TripleO, both of which have already
 addressed this in some respect, and have other changes on the horizon.

 Thanks for listening!

 --Dan

 1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c
 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/



-- 
Kind Regards,
Dmitriy Rabotyagov
 

 
 
-- 
Kind Regards,
Dmitriy Rabotyagov