[openstack-dev] [third-party-ci][neutron] What is "Success" exactly?
Anita Kuno
anteaya at anteaya.info
Tue Jul 1 13:42:12 UTC 2014
On 06/30/2014 09:13 PM, Jay Pipes wrote:
> On 06/30/2014 07:08 PM, Anita Kuno wrote:
>> On 06/30/2014 04:22 PM, Jay Pipes wrote:
>>> Hi Stackers,
>>>
>>> Some recent ML threads [1] and a hot IRC meeting today [2] brought up
>>> some legitimate questions around how a newly-proposed Stackalytics
>>> report page for Neutron External CI systems [2] represented the results
>>> of an external CI system as "successful" or not.
>>>
>>> First, I want to say that Ilya and all those involved in the
>>> Stackalytics program simply want to provide the most accurate
>>> information to developers in a format that is easily consumed. While
>>> there need to be some changes in how data is shown (and the wording of
>>> things like "Tests Succeeded"), I hope that the community knows there
>>> isn't any ill intent on the part of Mirantis or anyone who works on
>>> Stackalytics. OK, so let's keep the conversation civil -- we're all
>>> working towards the same goals of transparency and accuracy. :)
>>>
>>> Alright, now, Anita and Kurt Taylor were asking a very poignant
>>> question:
>>>
>>> "But what does CI tested really mean? just running tests? or tested to
>>> pass some level of requirements?"
>>>
>>> In this nascent world of external CI systems, we have a set of issues
>>> that we need to resolve:
>>>
>>> 1) All of the CI systems are different.
>>>
>>> Some run Bash scripts. Some run Jenkins slaves and devstack-gate
>>> scripts. Others run custom Python code that spawns VMs and publishes
>>> logs to some public domain.
>>>
>>> As a community, we need to decide whether it is worth putting in the
>>> effort to create a single, unified, installable and runnable CI system,
>>> so that we can legitimately say "all of the external systems are
>>> identical, with the exception of the driver code for vendor X being
>>> substituted in the Neutron codebase."
>>>
>>> If the goal of the external CI systems is to produce reliable,
>>> consistent results, I feel the answer to the above is "yes", but I'm
>>> interested to hear what others think. Frankly, in the world of
>>> benchmarks, it would be unthinkable to say "go ahead and everyone run
>>> your own benchmark suite", because you would get wildly different
>>> results. A similar problem has emerged here.
>>>
>>> 2) There is no mediation or verification that the external CI system is
>>> actually testing anything at all
>>>
>>> As a community, we need to decide whether the current system of
>>> self-policing should continue. If it should, then language on reports
>>> like [3] should be very clear that any numbers derived from such systems
>>> should be taken with a grain of salt. Use of the word "Success" should
>>> be avoided, as it has connotations (in English, at least) that the
>>> result has been verified, which is simply not the case as long as no
>>> verification or mediation occurs for any external CI system.
>>>
>>> 3) There is no clear indication of what tests are being run, and
>>> therefore there is no clear indication of what "success" is
>>>
>>> I think we can all agree that a test has three possible outcomes: pass,
>>> fail, and skip. The results of a test suite run therefore is nothing
>>> more than the aggregation of which tests passed, which failed, and which
>>> were skipped.
>>>
>>> As a community, we must document, for each project, what are expected
>>> set of tests that must be run for each merged patch into the project's
>>> source tree. This documentation should be discoverable so that reports
>>> like [3] can be crystal-clear on what the data shown actually means. The
>>> report is simply displaying the data it receives from Gerrit. The
>>> community needs to be proactive in saying "this is what is expected to
>>> be tested." This alone would allow the report to give information such
>>> as "External CI system ABC performed the expected tests. X tests passed.
>>> Y tests failed. Z tests were skipped." Likewise, it would also make it
>>> possible for the report to give information such as "External CI system
>>> DEF did not perform the expected tests.", which is excellent information
>>> in and of itself.
>>>
>>> ===
>>>
>>> In thinking about the likely answers to the above questions, I believe
>>> it would be prudent to change the Stackalytics report in question [3] in
>>> the following ways:
>>>
>>> a. Change the "Success %" column header to "% Reported +1 Votes"
>>> b. Change the phrase " Green cell - tests ran successfully, red cell -
>>> tests failed" to "Green cell - System voted +1, red cell - System
>>> voted -1"
>>>
>>> and then, when we have more and better data (for example, # tests
>>> passed, failed, skipped, etc), we can provide more detailed information
>>> than just "reported +1" or not.
>>>
>>> Thoughts?
>>>
>>> Best,
>>> -jay
>>>
>>> [1]
>>> http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.html
>>> [2]
>>> http://eavesdrop.openstack.org/meetings/third_party/2014/third_party.2014-06-30-18.01.log.html
>>>
>>>
>>> [3] http://stackalytics.com/report/ci/neutron/7
>>>
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> Hi Jay:
>>
>> Thanks for starting this thread. You raise some interesting questions.
>>
>> The question I had identified as needing definition is "what algorithm
>> do we use to assess fitness of a third party ci system".
>>
>> http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2014-06-30.log
>>
>> timestamp 2014-06-30T19:23:40
>>
>> This is the question that is top of mind for me.
>
> Right, my email above is written to say "unless there is a) uniformity
> of the external CI system, b) agreement on mediation or verification of
> said systems, and c) agreement on what tests shall be expected to pass
> and be skipped for each project, then no such algorithm is really
> possible."
>
> Now, if the community is willing to agree to a), b), and c), then
> certainly there is the ability to determine the fitness of a CI system
> -- at least in regards to its output (test results and the voting on the
> Gerrit system).
>
> Barring agreement on any or all of those three things, I recommended
> changing the language on the report due to the inability to have any
> consistently-applied algorithm to determine fitness.
>
> Best,
> -jay
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
I've been mulling this over and looking at how I assess feedback I get
from different human reviewers, since I don't know the basis of how they
arrive at their decisions unless they tell me and/or I have experience
with their criteria for how they review my patches.
I get different value from different human reviewers based upon my
experience of them reviewing my patches, my experience of them reviewing
other people's patches, my experience reviewing their code and my
discussions with them in channel, on the mailing list and in person, as
well as my experience reading or becoming aware of other decisions they
make.
It would be really valuable for me personally to have a page in gerrit
for each third party ci account, where I could sign in and leave
comments or vote +/-1 or 0 as a way of giving feedback to the
maintainers of that system. Also others could do the same and I could
read their feedback. For instance, yesterday someone linked me to logs
that forced me to download them to read. I hadn't been made aware this
account had been doing this, but this developer was aware. Currently we
have no system for a developer, in the course of their normal workflow,
to leave a comment and/or vote on a third party ci system to give those
maintainers feedback about how they are doing at providing consumable
artifacts from their system.
It also would remove the perception that I'm just a big meany, since
developers could comment for themselves, directly on the account, how
they feel about having to download tarballs, or sign into other systems
to trigger a recheck. The community of developers would say how fit a
system is or isn't since they are the individuals having to dig through
logs and evaluate "did this build fail because the code needs
adjustment" or not, and can reflect their findings in a comment and vote
on the system.
The other thing I really value about gerrit is that votes can change,
systems can improve, given motivation and accurate feedback for making
changes.
I have no idea how hard this would be to create, but I think having
direct feedback from developers on systems would help both the
developers and the maintainers of ci systems.
There are a number of people working really hard to do a good job in
this area. This sort of structure would also provide support and
encouragement to those people providing leadership in this space, people
asking good questions, helping other system maintainers, starting
discussions, offering patches to infra (and reviewing infra patches) in
accordance with the goals of the third party meeting[0] and other
hard-to-measure valuable decisions that provide value for the community.
I'd really like a way we all can demonstrate the extent to which we
value these contributions.
So far, those are my thoughts.
Thanks,
Anita.
[0]
https://wiki.openstack.org/wiki/Meetings/ThirdParty#Goals_for_Third_Party_meetings
More information about the OpenStack-dev
mailing list