[openstack-dev] [third-party-ci][neutron] What is "Success" exactly?
Jay Pipes
jaypipes at gmail.com
Tue Jul 1 01:13:07 UTC 2014
On 06/30/2014 07:08 PM, Anita Kuno wrote:
> On 06/30/2014 04:22 PM, Jay Pipes wrote:
>> Hi Stackers,
>>
>> Some recent ML threads [1] and a hot IRC meeting today [2] brought up
>> some legitimate questions around how a newly-proposed Stackalytics
>> report page for Neutron External CI systems [2] represented the results
>> of an external CI system as "successful" or not.
>>
>> First, I want to say that Ilya and all those involved in the
>> Stackalytics program simply want to provide the most accurate
>> information to developers in a format that is easily consumed. While
>> there need to be some changes in how data is shown (and the wording of
>> things like "Tests Succeeded"), I hope that the community knows there
>> isn't any ill intent on the part of Mirantis or anyone who works on
>> Stackalytics. OK, so let's keep the conversation civil -- we're all
>> working towards the same goals of transparency and accuracy. :)
>>
>> Alright, now, Anita and Kurt Taylor were asking a very poignant question:
>>
>> "But what does CI tested really mean? just running tests? or tested to
>> pass some level of requirements?"
>>
>> In this nascent world of external CI systems, we have a set of issues
>> that we need to resolve:
>>
>> 1) All of the CI systems are different.
>>
>> Some run Bash scripts. Some run Jenkins slaves and devstack-gate
>> scripts. Others run custom Python code that spawns VMs and publishes
>> logs to some public domain.
>>
>> As a community, we need to decide whether it is worth putting in the
>> effort to create a single, unified, installable and runnable CI system,
>> so that we can legitimately say "all of the external systems are
>> identical, with the exception of the driver code for vendor X being
>> substituted in the Neutron codebase."
>>
>> If the goal of the external CI systems is to produce reliable,
>> consistent results, I feel the answer to the above is "yes", but I'm
>> interested to hear what others think. Frankly, in the world of
>> benchmarks, it would be unthinkable to say "go ahead and everyone run
>> your own benchmark suite", because you would get wildly different
>> results. A similar problem has emerged here.
>>
>> 2) There is no mediation or verification that the external CI system is
>> actually testing anything at all
>>
>> As a community, we need to decide whether the current system of
>> self-policing should continue. If it should, then language on reports
>> like [3] should be very clear that any numbers derived from such systems
>> should be taken with a grain of salt. Use of the word "Success" should
>> be avoided, as it has connotations (in English, at least) that the
>> result has been verified, which is simply not the case as long as no
>> verification or mediation occurs for any external CI system.
>>
>> 3) There is no clear indication of what tests are being run, and
>> therefore there is no clear indication of what "success" is
>>
>> I think we can all agree that a test has three possible outcomes: pass,
>> fail, and skip. The results of a test suite run therefore is nothing
>> more than the aggregation of which tests passed, which failed, and which
>> were skipped.
>>
>> As a community, we must document, for each project, what are expected
>> set of tests that must be run for each merged patch into the project's
>> source tree. This documentation should be discoverable so that reports
>> like [3] can be crystal-clear on what the data shown actually means. The
>> report is simply displaying the data it receives from Gerrit. The
>> community needs to be proactive in saying "this is what is expected to
>> be tested." This alone would allow the report to give information such
>> as "External CI system ABC performed the expected tests. X tests passed.
>> Y tests failed. Z tests were skipped." Likewise, it would also make it
>> possible for the report to give information such as "External CI system
>> DEF did not perform the expected tests.", which is excellent information
>> in and of itself.
>>
>> ===
>>
>> In thinking about the likely answers to the above questions, I believe
>> it would be prudent to change the Stackalytics report in question [3] in
>> the following ways:
>>
>> a. Change the "Success %" column header to "% Reported +1 Votes"
>> b. Change the phrase " Green cell - tests ran successfully, red cell -
>> tests failed" to "Green cell - System voted +1, red cell - System voted -1"
>>
>> and then, when we have more and better data (for example, # tests
>> passed, failed, skipped, etc), we can provide more detailed information
>> than just "reported +1" or not.
>>
>> Thoughts?
>>
>> Best,
>> -jay
>>
>> [1]
>> http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.html
>> [2]
>> http://eavesdrop.openstack.org/meetings/third_party/2014/third_party.2014-06-30-18.01.log.html
>>
>> [3] http://stackalytics.com/report/ci/neutron/7
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> Hi Jay:
>
> Thanks for starting this thread. You raise some interesting questions.
>
> The question I had identified as needing definition is "what algorithm
> do we use to assess fitness of a third party ci system".
>
> http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2014-06-30.log
> timestamp 2014-06-30T19:23:40
>
> This is the question that is top of mind for me.
Right, my email above is written to say "unless there is a) uniformity
of the external CI system, b) agreement on mediation or verification of
said systems, and c) agreement on what tests shall be expected to pass
and be skipped for each project, then no such algorithm is really possible."
Now, if the community is willing to agree to a), b), and c), then
certainly there is the ability to determine the fitness of a CI system
-- at least in regards to its output (test results and the voting on the
Gerrit system).
Barring agreement on any or all of those three things, I recommended
changing the language on the report due to the inability to have any
consistently-applied algorithm to determine fitness.
Best,
-jay
More information about the OpenStack-dev
mailing list