[openstack-dev] [third-party-ci][neutron] What is "Success" exactly?

Jay Pipes jaypipes at gmail.com
Mon Jun 30 20:22:36 UTC 2014

Hi Stackers,

Some recent ML threads [1] and a hot IRC meeting today [2] brought up 
some legitimate questions around how a newly-proposed Stackalytics 
report page for Neutron External CI systems [2] represented the results 
of an external CI system as "successful" or not.

First, I want to say that Ilya and all those involved in the 
Stackalytics program simply want to provide the most accurate 
information to developers in a format that is easily consumed. While 
there need to be some changes in how data is shown (and the wording of 
things like "Tests Succeeded"), I hope that the community knows there 
isn't any ill intent on the part of Mirantis or anyone who works on 
Stackalytics. OK, so let's keep the conversation civil -- we're all 
working towards the same goals of transparency and accuracy. :)

Alright, now, Anita and Kurt Taylor were asking a very poignant question:

"But what does CI tested really mean? just running tests? or tested to 
pass some level of requirements?"

In this nascent world of external CI systems, we have a set of issues 
that we need to resolve:

1) All of the CI systems are different.

Some run Bash scripts. Some run Jenkins slaves and devstack-gate 
scripts. Others run custom Python code that spawns VMs and publishes 
logs to some public domain.

As a community, we need to decide whether it is worth putting in the 
effort to create a single, unified, installable and runnable CI system, 
so that we can legitimately say "all of the external systems are 
identical, with the exception of the driver code for vendor X being 
substituted in the Neutron codebase."

If the goal of the external CI systems is to produce reliable, 
consistent results, I feel the answer to the above is "yes", but I'm 
interested to hear what others think. Frankly, in the world of 
benchmarks, it would be unthinkable to say "go ahead and everyone run 
your own benchmark suite", because you would get wildly different 
results. A similar problem has emerged here.

2) There is no mediation or verification that the external CI system is 
actually testing anything at all

As a community, we need to decide whether the current system of 
self-policing should continue. If it should, then language on reports 
like [3] should be very clear that any numbers derived from such systems 
should be taken with a grain of salt. Use of the word "Success" should 
be avoided, as it has connotations (in English, at least) that the 
result has been verified, which is simply not the case as long as no 
verification or mediation occurs for any external CI system.

3) There is no clear indication of what tests are being run, and 
therefore there is no clear indication of what "success" is

I think we can all agree that a test has three possible outcomes: pass, 
fail, and skip. The results of a test suite run therefore is nothing 
more than the aggregation of which tests passed, which failed, and which 
were skipped.

As a community, we must document, for each project, what are expected 
set of tests that must be run for each merged patch into the project's 
source tree. This documentation should be discoverable so that reports 
like [3] can be crystal-clear on what the data shown actually means. The 
report is simply displaying the data it receives from Gerrit. The 
community needs to be proactive in saying "this is what is expected to 
be tested." This alone would allow the report to give information such 
as "External CI system ABC performed the expected tests. X tests passed. 
Y tests failed. Z tests were skipped." Likewise, it would also make it 
possible for the report to give information such as "External CI system 
DEF did not perform the expected tests.", which is excellent information 
in and of itself.


In thinking about the likely answers to the above questions, I believe 
it would be prudent to change the Stackalytics report in question [3] in 
the following ways:

a. Change the "Success %" column header to "% Reported +1 Votes"
b. Change the phrase " Green cell - tests ran successfully, red cell - 
tests failed" to "Green cell - System voted +1, red cell - System voted -1"

and then, when we have more and better data (for example, # tests 
passed, failed, skipped, etc), we can provide more detailed information 
than just "reported +1" or not.



[1] http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.html
[3] http://stackalytics.com/report/ci/neutron/7

More information about the OpenStack-dev mailing list