Open Stack

Fri Feb 5 21:52:15 UTC 2021

On Thu, Feb 4, 2021 at 9:41 PM Mark Goddard <mark at stackhpc.com> wrote:

>
>
> On Thu, 4 Feb 2021, 17:29 Dan Smith, <dms at danplanet.com> wrote:
>
>> Hi all,
>>
>> I have become increasingly concerned with CI performance lately, and
>> have been raising those concerns with various people. Most specifically,
>> I'm worried about our turnaround time or "time to get a result", which
>> has been creeping up lately. Right after the beginning of the year, we
>> had a really bad week where the turnaround time was well over 24
>> hours. That means if you submit a patch on Tuesday afternoon, you might
>> not get a test result until Thursday. That is, IMHO, a real problem and
>> massively hurts our ability to quickly merge priority fixes as well as
>> just general velocity and morale. If people won't review my code until
>> they see a +1 from Zuul, and that is two days after I submitted it,
>> that's bad.
>>
> Thanks for looking into this Dan, it's definitely an important issue and
> can introduce a lot of friction into and already heavy development process.
>
>>
>> Things have gotten a little better since that week, due in part to
>> getting past a rush of new year submissions (we think) and also due to
>> some job trimming in various places (thanks Neutron!). However, things
>> are still not great. Being in almost the last timezone of the day, the
>> queue is usually so full when I wake up that it's quite often I don't
>> get to see a result before I stop working that day.
>>
>> I would like to ask that projects review their jobs for places where
>> they can cut out redundancy, as well as turn their eyes towards
>> optimizations that can be made. I've been looking at both Nova and
>> Glance jobs and have found some things I think we can do less of. I also
>> wanted to get an idea of who is "using too much" in the way of
>> resources, so I've been working on trying to characterize the weight of
>> the jobs we run for a project, based on the number of worker nodes
>> required to run all the jobs, as well as the wall clock time of how long
>> we tie those up. The results are interesting, I think, and may help us
>> to identify where we see some gains.
>>
>> The idea here is to figure out[1] how many "node hours" it takes to run
>> all the normal jobs on a Nova patch compared to, say, a Neutron one. If
>> the jobs were totally serialized, this is the number of hours a single
>> computer (of the size of a CI worker) would take to do all that work. If
>> the number is 24 hours, that means a single computer could only check
>> *one* patch in a day, running around the clock. I chose the top five
>> projects in terms of usage[2] to report here, as they represent 70% of
>> the total amount of resources consumed. The next five only add up to
>> 13%, so the "top five" seems like a good target group. Here are the
>> results, in order of total consumption:
>>
>>     Project     % of total  Node Hours  Nodes
>>     ------------------------------------------
>>     1. TripleO    38%       31 hours     20
>>     2. Neutron    13%       38 hours     32
>>     3. Nova       9%        21 hours     25
>>     4. Kolla      5%        12 hours     18
>>     5. OSA        5%        22 hours     17
>>
>
> Acknowledging Kolla is in the top 5. Deployment projects certainly tend to
> consume resources. I'll raise this at our next meeting and see what we can
> come up with.
>
> What that means is that a single computer (of the size of a CI worker)
>> couldn't even process the jobs required to run on a single patch for
>> Neutron or TripleO in a 24-hour period. Now, we have lots of workers in
>> the gate, of course, but there is also other potential overhead involved
>> in that parallelism, like waiting for nodes to be available for
>> dependent jobs. And of course, we'd like to be able to check more than
>> patch per day. Most projects have smaller gate job sets than check, but
>> assuming they are equivalent, a Neutron patch from submission to commit
>> would undergo 76 hours of testing, not including revisions and not
>> including rechecks. That's an enormous amount of time and resource for a
>> single patch!
>>
>> Now, obviously nobody wants to run fewer tests on patches before they
>> land, and I'm not really suggesting that we take that approach
>> necessarily. However, I think there are probably a lot of places that we
>> can cut down the amount of *work* we do. Some ways to do this are:
>>
>> 1. Evaluate whether or not you need to run all of tempest on two
>>    configurations of a devstack on each patch. Maybe having a
>>    stripped-down tempest (like just smoke) to run on unique configs, or
>>    even specific tests.
>> 2. Revisit your "irrelevant_files" lists to see where you might be able
>>    to avoid running heavy jobs on patches that only touch something
>>    small.
>> 3. Consider moving some jobs to the experimental queue and run them
>>    on-demand for patches that touch particular subsystems or affect
>>    particular configurations.
>> 4. Consider some periodic testing for things that maybe don't need to
>>    run on every single patch.
>> 5. Re-examine tests that take a long time to run to see if something can
>>    be done to make them more efficient.
>> 6. Consider performance improvements in the actual server projects,
>>    which also benefits the users.
>
>
> 7. Improve the reliability of jobs. Especially voting and gating ones.
> Rechecks increase resource usage and time to results/merge. I found
> querying the zuul API for failed jobs in the gate pipeline is a good way to
> find unexpected failures.
>

7.1. Stop marking dependent patches with Verified-2 if their parent fails
in the gate, keep them at Verified+1 (their previous state). This is a
common source of unnecessary rechecks in the ironic land.

>
> 8. Reduce the node count in multi node jobs.
>
> If you're a project that is not in the top ten then your job
>> configuration probably doesn't matter that much, since your usage is
>> dwarfed by the heavy projects. If the heavy projects would consider
>> making changes to decrease their workload, even small gains have the
>> ability to multiply into noticeable improvement. The higher you are on
>> the above list, the more impact a small change will have on the overall
>> picture.
>>
>> Also, thanks to Neutron and TripleO, both of which have already
>> addressed this in some respect, and have other changes on the horizon.
>>
>> Thanks for listening!
>>
>> --Dan
>>
>> 1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c
>> 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/
>>
>>

-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael
O'Neill
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210205/1dfc4f57/attachment.html>

Open Stack

[all] Gate resources and performance

OpenStack

Community

Documentation

Branding & Legal