[openstack-dev] [tripleo] [ci] recheck impact on CI infrastructure

Sven Anderson sven at redhat.com
Mon Jan 2 15:49:35 UTC 2017


Hi Emilien and all,

On 16.12.2016 01:26, Emilien Macchi wrote:
> On Thu, Dec 15, 2016 at 12:22 PM, Sven Anderson <sven at redhat.com> wrote:
>> Hi all,
>>
>> while I was waiting again for the CI to be fixed and didn't want to
>> torture it with additional rechecks, I wanted to find out, how much of
>> our CI infrastructure we waste with rechecks. My assumption was that
>> every recheck is a waste of resources based on a false negative, because
>> it renders the previous build useless. So I wrote a small script[1] to
>> calculate how many rechecks are made on average per built patch-set. It
>> calculates the number of patch-sets of merged changes that CI was
>> testing (some patch-sets are not, because they were updated before CI
>> started testing), the number of rechecks issued on these patch-sets, and
>> a value "CI-factor", which is the factor by which the rechecks increased
>> the the CI runs, that is, without rechecks it would be 1, if every
>> tested patch-set would have exactly one recheck it would be 2.
> 
> I see 2 different topics here.
> 
> # One is not related to $topic but still worth mentioning:
> "while I was waiting again for the CI to be fixed"
> 
> This week has been tough, and many of us burnt our time to resolve
> different complex problems in TripleO CI, mostly related to external
> dependencies (qemu upgrade, centos 7.3 upgrade, tripleo-ci infra,
> etc).
> Resolving these problems is very challenging and you'll notice that
> only a few of us actually work on this task, while a lot of people
> continue to push their features "hoping" that it will pass CI
> sometimes and if not, well, we'll do 'recheck'.
> That is a way of working I would say. I personally can't continue to
> code if the project I'm working on has broken CI.
> 
> In a previous experience, I've been working in a team where everyone
> stopped regular work when CI was broken and focus on fixing it.
> I'm not saying everyone should stop their tasks and help, but this
> "wait and see" comment doesn't actually help us to move forward.
> People need to get more involved in CI and be more helpful. I know
> it's difficult, but it's something anyone can learn, like you would
> learn how to write Python code for example.

I think you got my mail in the wrong way. I didn't want to say that
anyone is not doing it's job right and I didn't want to complain. I know
how challenging this is. In my previous job I was the person running the
CI (among other things). I just wanted to share the results, because I
think it's interesting how much percentage of our CI infrastructure is
"wasted" by rechecks, to on one hand raise awareness that we not just
blindly "recheck until verified", and on the other hand, how valuable it
is to keep CI stable.

Is it really the case that more CI people would help here? I would have
expected, as long as we don't do more modularized testing, that it
doesn't scale. Would more CI people fix the problems more quickly? Or is
it more like: the burden could be distributed on more shoulders, so not
always the same people have to interrupt their work? The second wouldn't
improve the situation but just spread the burden in a more fair manner.

With my post I mainly wanted to provide reliable data and emphasize how
important a stable CI and the work on this is, and that we all restrain
ourselves from blindly rechecking.


Happy New Year to everyone!

Sven



More information about the OpenStack-dev mailing list