<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 4 Feb 2021, 17:29 Dan Smith, <<a href="mailto:dms@danplanet.com">dms@danplanet.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

<br>

I have become increasingly concerned with CI performance lately, and<br>

have been raising those concerns with various people. Most specifically,<br>

I'm worried about our turnaround time or "time to get a result", which<br>

has been creeping up lately. Right after the beginning of the year, we<br>

had a really bad week where the turnaround time was well over 24<br>

hours. That means if you submit a patch on Tuesday afternoon, you might<br>

not get a test result until Thursday. That is, IMHO, a real problem and<br>

massively hurts our ability to quickly merge priority fixes as well as<br>

just general velocity and morale. If people won't review my code until<br>

they see a +1 from Zuul, and that is two days after I submitted it,<br>

that's bad.<br></blockquote></div></div><div dir="auto">Thanks for looking into this Dan, it's definitely an important issue and can introduce a lot of friction into and already heavy development process.</div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Things have gotten a little better since that week, due in part to<br>

getting past a rush of new year submissions (we think) and also due to<br>

some job trimming in various places (thanks Neutron!). However, things<br>

are still not great. Being in almost the last timezone of the day, the<br>

queue is usually so full when I wake up that it's quite often I don't<br>

get to see a result before I stop working that day.<br>

<br>

I would like to ask that projects review their jobs for places where<br>

they can cut out redundancy, as well as turn their eyes towards<br>

optimizations that can be made. I've been looking at both Nova and<br>

Glance jobs and have found some things I think we can do less of. I also<br>

wanted to get an idea of who is "using too much" in the way of<br>

resources, so I've been working on trying to characterize the weight of<br>

the jobs we run for a project, based on the number of worker nodes<br>

required to run all the jobs, as well as the wall clock time of how long<br>

we tie those up. The results are interesting, I think, and may help us<br>

to identify where we see some gains.<br>

<br>

The idea here is to figure out[1] how many "node hours" it takes to run<br>

all the normal jobs on a Nova patch compared to, say, a Neutron one. If<br>

the jobs were totally serialized, this is the number of hours a single<br>

computer (of the size of a CI worker) would take to do all that work. If<br>

the number is 24 hours, that means a single computer could only check<br>

*one* patch in a day, running around the clock. I chose the top five<br>

projects in terms of usage[2] to report here, as they represent 70% of<br>

the total amount of resources consumed. The next five only add up to<br>

13%, so the "top five" seems like a good target group. Here are the<br>

results, in order of total consumption:<br>

<br>

    Project     % of total  Node Hours  Nodes<br>

    ------------------------------------------<br>

    1. TripleO    38%       31 hours     20<br>

    2. Neutron    13%       38 hours     32<br>

    3. Nova       9%        21 hours     25<br>

    4. Kolla      5%        12 hours     18<br>

    5. OSA        5%        22 hours     17<br></blockquote></div></div><div dir="auto"><br></div><div dir="auto">Acknowledging Kolla is in the top 5. Deployment projects certainly tend to consume resources. I'll raise this at our next meeting and see what we can come up with.</div><div dir="auto"><br></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

What that means is that a single computer (of the size of a CI worker)<br>

couldn't even process the jobs required to run on a single patch for<br>

Neutron or TripleO in a 24-hour period. Now, we have lots of workers in<br>

the gate, of course, but there is also other potential overhead involved<br>

in that parallelism, like waiting for nodes to be available for<br>

dependent jobs. And of course, we'd like to be able to check more than<br>

patch per day. Most projects have smaller gate job sets than check, but<br>

assuming they are equivalent, a Neutron patch from submission to commit<br>

would undergo 76 hours of testing, not including revisions and not<br>

including rechecks. That's an enormous amount of time and resource for a<br>

single patch!<br>

<br>

Now, obviously nobody wants to run fewer tests on patches before they<br>

land, and I'm not really suggesting that we take that approach<br>

necessarily. However, I think there are probably a lot of places that we<br>

can cut down the amount of *work* we do. Some ways to do this are:<br>

<br>

1. Evaluate whether or not you need to run all of tempest on two<br>

   configurations of a devstack on each patch. Maybe having a<br>

   stripped-down tempest (like just smoke) to run on unique configs, or<br>

   even specific tests.<br>

2. Revisit your "irrelevant_files" lists to see where you might be able<br>

   to avoid running heavy jobs on patches that only touch something<br>

   small.<br>

3. Consider moving some jobs to the experimental queue and run them<br>

   on-demand for patches that touch particular subsystems or affect<br>

   particular configurations.<br>

4. Consider some periodic testing for things that maybe don't need to<br>

   run on every single patch.<br>

5. Re-examine tests that take a long time to run to see if something can<br>

   be done to make them more efficient.<br>

6. Consider performance improvements in the actual server projects,<br>

   which also benefits the users.</blockquote></div></div><div dir="auto"><br></div><div dir="auto"></div><div dir="auto">7. Improve the reliability of jobs. Especially voting and gating ones. Rechecks increase resource usage and time to results/merge. I found querying the zuul API for failed jobs in the gate pipeline is a good way to find unexpected failures.</div><div dir="auto"><br></div><div dir="auto">8. Reduce the node count in multi node jobs.</div><div dir="auto"><br></div><div dir="auto"></div><div dir="auto"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

If you're a project that is not in the top ten then your job<br>

configuration probably doesn't matter that much, since your usage is<br>

dwarfed by the heavy projects. If the heavy projects would consider<br>

making changes to decrease their workload, even small gains have the<br>

ability to multiply into noticeable improvement. The higher you are on<br>

the above list, the more impact a small change will have on the overall<br>

picture.<br>

<br>

Also, thanks to Neutron and TripleO, both of which have already<br>

addressed this in some respect, and have other changes on the horizon.<br>

<br>

Thanks for listening!<br>

<br>

--Dan<br>

<br>

1: <a href="https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c" rel="noreferrer noreferrer" target="_blank">https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c</a><br>

2; <a href="http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/" rel="noreferrer noreferrer" target="_blank">http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/</a><br>

<br>

</blockquote></div></div></div>