[all] Gate resources and performance

Marios Andreou marios at redhat.com
Fri Feb 5 08:33:40 UTC 2021

On Thu, Feb 4, 2021 at 7:30 PM Dan Smith <dms at danplanet.com> wrote:

> Hi all,
> I have become increasingly concerned with CI performance lately, and
> have been raising those concerns with various people. Most specifically,
> I'm worried about our turnaround time or "time to get a result", which
> has been creeping up lately. Right after the beginning of the year, we
> had a really bad week where the turnaround time was well over 24
> hours. That means if you submit a patch on Tuesday afternoon, you might
> not get a test result until Thursday. That is, IMHO, a real problem and
> massively hurts our ability to quickly merge priority fixes as well as
> just general velocity and morale. If people won't review my code until
> they see a +1 from Zuul, and that is two days after I submitted it,
> that's bad.
> Things have gotten a little better since that week, due in part to
> getting past a rush of new year submissions (we think) and also due to
> some job trimming in various places (thanks Neutron!). However, things
> are still not great. Being in almost the last timezone of the day, the
> queue is usually so full when I wake up that it's quite often I don't
> get to see a result before I stop working that day.

first thanks for bringing this topic - fully agreed that 24 hours before
zuul reports back on a patch is unacceptable. The tripleo-ci team is
*always* looking at improving CI efficiency, if nothing else for the very
reason you started this thread i.e. we don't want so many jobs (or too many
long jobs) that it takes 24 or more hours for zuul to report (ie this
obviously affects us, too). We have been called out as a community on
resource usage in the past so we are of course aware of, acknowledge and
are trying to address the issue.

> I would like to ask that projects review their jobs for places where
> they can cut out redundancy, as well as turn their eyes towards
> optimizations that can be made. I've been looking at both Nova and
> Glance jobs and have found some things I think we can do less of. I also
> wanted to get an idea of who is "using too much" in the way of
> resources, so I've been working on trying to characterize the weight of
> the jobs we run for a project, based on the number of worker nodes
> required to run all the jobs, as well as the wall clock time of how long
> we tie those up. The results are interesting, I think, and may help us
> to identify where we see some gains.
> The idea here is to figure out[1] how many "node hours" it takes to run
> all the normal jobs on a Nova patch compared to, say, a Neutron one. If

just wanted to point out the 'node hours' comparison may not be fair
because what is a typical nova patch or a typical tripleo patch? The number
of jobs matched & executed by zuul on a given review will be different to
another tripleo patch in the same repo depending on the files touched or
branch (etc.) and will vary even more compared to other tripleo repos; I
think this is the same for nova or any other project with multiple repos.

> the jobs were totally serialized, this is the number of hours a single
> computer (of the size of a CI worker) would take to do all that work. If
> the number is 24 hours, that means a single computer could only check
> *one* patch in a day, running around the clock. I chose the top five
> projects in terms of usage[2] to report here, as they represent 70% of
> the total amount of resources consumed. The next five only add up to
> 13%, so the "top five" seems like a good target group. Here are the
> results, in order of total consumption:
>     Project     % of total  Node Hours  Nodes
>     ------------------------------------------
>     1. TripleO    38%       31 hours     20
>     2. Neutron    13%       38 hours     32
>     3. Nova       9%        21 hours     25
>     4. Kolla      5%        12 hours     18
>     5. OSA        5%        22 hours     17
> What that means is that a single computer (of the size of a CI worker)
> couldn't even process the jobs required to run on a single patch for
> Neutron or TripleO in a 24-hour period. Now, we have lots of workers in
> the gate, of course, but there is also other potential overhead involved
> in that parallelism, like waiting for nodes to be available for
> dependent jobs. And of course, we'd like to be able to check more than
> patch per day. Most projects have smaller gate job sets than check, but
> assuming they are equivalent, a Neutron patch from submission to commit
> would undergo 76 hours of testing, not including revisions and not
> including rechecks. That's an enormous amount of time and resource for a
> single patch!
> Now, obviously nobody wants to run fewer tests on patches before they
> land, and I'm not really suggesting that we take that approach
> necessarily. However, I think there are probably a lot of places that we
> can cut down the amount of *work* we do. Some ways to do this are:
> 1. Evaluate whether or not you need to run all of tempest on two
>    configurations of a devstack on each patch. Maybe having a
>    stripped-down tempest (like just smoke) to run on unique configs, or
>    even specific tests.
> 2. Revisit your "irrelevant_files" lists to see where you might be able
>    to avoid running heavy jobs on patches that only touch something
>    small.
> 3. Consider moving some jobs to the experimental queue and run them
>    on-demand for patches that touch particular subsystems or affect
>    particular configurations.
> 4. Consider some periodic testing for things that maybe don't need to
>    run on every single patch.
> 5. Re-examine tests that take a long time to run to see if something can
>    be done to make them more efficient.
> 6. Consider performance improvements in the actual server projects,
>    which also benefits the users.

ACK. We have recently completed some work (as I said, this is an ongoing
issue/process for us) at [1][2] to remove some redundant jobs which should
start to help. Mohamed (mnaser o/) has  reached out about this and joined
our most recent irc meeting [3]. We're already prioritized some more
cleanup work for this sprint including checking file patterns (e.g. started
at [4]), tempest tests and removing many/all of our non-voting jobs as a
first pass. Hope that at least starts to address you concern,

regards, marios

[1] https://review.opendev.org/q/topic:reduce-content-providers
[2] https://review.opendev.org/q/topic:tripleo-c7-update-upgrade-removal
[4] https://review.opendev.org/c/openstack/tripleo-ci/+/773692

> If you're a project that is not in the top ten then your job
> configuration probably doesn't matter that much, since your usage is
> dwarfed by the heavy projects. If the heavy projects would consider
> making changes to decrease their workload, even small gains have the
> ability to multiply into noticeable improvement. The higher you are on
> the above list, the more impact a small change will have on the overall
> picture.
> Also, thanks to Neutron and TripleO, both of which have already
> addressed this in some respect, and have other changes on the horizon.
> Thanks for listening!
> --Dan
> 1: https://gist.github.com/kk7ds/5edbfacb2a341bb18df8f8f32d01b37c
> 2; http://paste.openstack.org/show/C4pwUpdgwUDrpW6V6vnC/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20210205/150e64fd/attachment.html>

More information about the openstack-discuss mailing list