[openstack-dev] [third-party-ci][neutron] What is "Success" exactly?
Jay Pipes
jaypipes at gmail.com
Mon Jun 30 20:22:36 UTC 2014
Hi Stackers,
Some recent ML threads [1] and a hot IRC meeting today [2] brought up
some legitimate questions around how a newly-proposed Stackalytics
report page for Neutron External CI systems [2] represented the results
of an external CI system as "successful" or not.
First, I want to say that Ilya and all those involved in the
Stackalytics program simply want to provide the most accurate
information to developers in a format that is easily consumed. While
there need to be some changes in how data is shown (and the wording of
things like "Tests Succeeded"), I hope that the community knows there
isn't any ill intent on the part of Mirantis or anyone who works on
Stackalytics. OK, so let's keep the conversation civil -- we're all
working towards the same goals of transparency and accuracy. :)
Alright, now, Anita and Kurt Taylor were asking a very poignant question:
"But what does CI tested really mean? just running tests? or tested to
pass some level of requirements?"
In this nascent world of external CI systems, we have a set of issues
that we need to resolve:
1) All of the CI systems are different.
Some run Bash scripts. Some run Jenkins slaves and devstack-gate
scripts. Others run custom Python code that spawns VMs and publishes
logs to some public domain.
As a community, we need to decide whether it is worth putting in the
effort to create a single, unified, installable and runnable CI system,
so that we can legitimately say "all of the external systems are
identical, with the exception of the driver code for vendor X being
substituted in the Neutron codebase."
If the goal of the external CI systems is to produce reliable,
consistent results, I feel the answer to the above is "yes", but I'm
interested to hear what others think. Frankly, in the world of
benchmarks, it would be unthinkable to say "go ahead and everyone run
your own benchmark suite", because you would get wildly different
results. A similar problem has emerged here.
2) There is no mediation or verification that the external CI system is
actually testing anything at all
As a community, we need to decide whether the current system of
self-policing should continue. If it should, then language on reports
like [3] should be very clear that any numbers derived from such systems
should be taken with a grain of salt. Use of the word "Success" should
be avoided, as it has connotations (in English, at least) that the
result has been verified, which is simply not the case as long as no
verification or mediation occurs for any external CI system.
3) There is no clear indication of what tests are being run, and
therefore there is no clear indication of what "success" is
I think we can all agree that a test has three possible outcomes: pass,
fail, and skip. The results of a test suite run therefore is nothing
more than the aggregation of which tests passed, which failed, and which
were skipped.
As a community, we must document, for each project, what are expected
set of tests that must be run for each merged patch into the project's
source tree. This documentation should be discoverable so that reports
like [3] can be crystal-clear on what the data shown actually means. The
report is simply displaying the data it receives from Gerrit. The
community needs to be proactive in saying "this is what is expected to
be tested." This alone would allow the report to give information such
as "External CI system ABC performed the expected tests. X tests passed.
Y tests failed. Z tests were skipped." Likewise, it would also make it
possible for the report to give information such as "External CI system
DEF did not perform the expected tests.", which is excellent information
in and of itself.
===
In thinking about the likely answers to the above questions, I believe
it would be prudent to change the Stackalytics report in question [3] in
the following ways:
a. Change the "Success %" column header to "% Reported +1 Votes"
b. Change the phrase " Green cell - tests ran successfully, red cell -
tests failed" to "Green cell - System voted +1, red cell - System voted -1"
and then, when we have more and better data (for example, # tests
passed, failed, skipped, etc), we can provide more detailed information
than just "reported +1" or not.
Thoughts?
Best,
-jay
[1] http://lists.openstack.org/pipermail/openstack-dev/2014-June/038933.html
[2]
http://eavesdrop.openstack.org/meetings/third_party/2014/third_party.2014-06-30-18.01.log.html
[3] http://stackalytics.com/report/ci/neutron/7
More information about the OpenStack-dev
mailing list