[openstack-dev] [third-party-ci][neutron] What is "Success" exactly?
Salvatore Orlando
sorlando at nicira.com
Thu Jul 3 11:12:17 UTC 2014
Apologies for quoting again the top post of the thread.
Comments inline (mostly thinking aloud)
Salvatore
On 30 June 2014 22:22, Jay Pipes <jaypipes at gmail.com> wrote:
> Hi Stackers,
>
> Some recent ML threads [1] and a hot IRC meeting today [2] brought up some
> legitimate questions around how a newly-proposed Stackalytics report page
> for Neutron External CI systems [2] represented the results of an external
> CI system as "successful" or not.
>
> First, I want to say that Ilya and all those involved in the Stackalytics
> program simply want to provide the most accurate information to developers
> in a format that is easily consumed. While there need to be some changes in
> how data is shown (and the wording of things like "Tests Succeeded"), I
> hope that the community knows there isn't any ill intent on the part of
> Mirantis or anyone who works on Stackalytics. OK, so let's keep the
> conversation civil -- we're all working towards the same goals of
> transparency and accuracy. :)
>
> Alright, now, Anita and Kurt Taylor were asking a very poignant question:
>
> "But what does CI tested really mean? just running tests? or tested to
> pass some level of requirements?"
>
> In this nascent world of external CI systems, we have a set of issues that
> we need to resolve:
>
> 1) All of the CI systems are different.
>
> Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts.
> Others run custom Python code that spawns VMs and publishes logs to some
> public domain.
>
> As a community, we need to decide whether it is worth putting in the
> effort to create a single, unified, installable and runnable CI system, so
> that we can legitimately say "all of the external systems are identical,
> with the exception of the driver code for vendor X being substituted in the
> Neutron codebase."
>
I think such system already exists, and it's documented here:
http://ci.openstack.org/
Still, understanding it is quite a learning curve, and running it is not
exactly straightforward. But I guess that's pretty much understandable
given the complexity of the system, isn't it?
>
> If the goal of the external CI systems is to produce reliable, consistent
> results, I feel the answer to the above is "yes", but I'm interested to
> hear what others think. Frankly, in the world of benchmarks, it would be
> unthinkable to say "go ahead and everyone run your own benchmark suite",
> because you would get wildly different results. A similar problem has
> emerged here.
>
I don't think the particular infrastructure which might range from an
openstack-ci clone to a 100-line bash script would have an impact on the
"reliability" of the quality assessment regarding a particular driver or
plugin. This is determined, in my opinion, by the quantity and nature of
tests one runs on a specific driver. In Neutron for instance, there is a
wide range of choices - from a few test cases in tempest.api.network to the
full smoketest job. As long there is no minimal standard here, then it
would be difficult to assess the quality of the evaluation from a CI
system, unless we explicitly keep into account coverage into the evaluation.
On the other hand, different CI infrastructures will have different levels
in terms of % of patches tested and % of infrastructure failures. I think
it might not be a terrible idea to use these parameters to evaluate how
good a CI is from an infra standpoint. However, there are still open
questions. For instance, a CI might have a low patch % score because it
only needs to test patches affecting a given driver.
> 2) There is no mediation or verification that the external CI system is
> actually testing anything at all
>
> As a community, we need to decide whether the current system of
> self-policing should continue. If it should, then language on reports like
> [3] should be very clear that any numbers derived from such systems should
> be taken with a grain of salt. Use of the word "Success" should be avoided,
> as it has connotations (in English, at least) that the result has been
> verified, which is simply not the case as long as no verification or
> mediation occurs for any external CI system.
>
> 3) There is no clear indication of what tests are being run, and therefore
> there is no clear indication of what "success" is
>
> I think we can all agree that a test has three possible outcomes: pass,
> fail, and skip. The results of a test suite run therefore is nothing more
> than the aggregation of which tests passed, which failed, and which were
> skipped.
>
> As a community, we must document, for each project, what are expected set
> of tests that must be run for each merged patch into the project's source
> tree. This documentation should be discoverable so that reports like [3]
> can be crystal-clear on what the data shown actually means. The report is
> simply displaying the data it receives from Gerrit. The community needs to
> be proactive in saying "this is what is expected to be tested." This alone
> would allow the report to give information such as "External CI system ABC
> performed the expected tests. X tests passed. Y tests failed. Z tests were
> skipped." Likewise, it would also make it possible for the report to give
> information such as "External CI system DEF did not perform the expected
> tests.", which is excellent information in and of itself.
>
>
Agreed. In Neutron we have enforced CIs but not yet agreed on what's the
minimum set of tests we expect them to run. I reckon this will be fixed
soon.
I'll try to look at what "SUCCESS" is from a naive standpoint: a CI says
"SUCCESS" if the test suite it rans passed; then one should have means to
understand whether a CI might blatantly lie or tell "half truths". For
instance saying it passes tempest.api.network while
tempest.scenario.test_network_basic_ops has not been executed is a half
truth, in my opinion.
Stackalitycs can help here, I think. One could create "CI classes"
according to how much they're close to the level of the upstream gate, and
then parse results posted to classify CIs. Now, before cursing me, I
totally understand that this won't be easy at all to implement!
Furthermore, I don't know whether how this should be reflected in gerrit.
> ===
>
> In thinking about the likely answers to the above questions, I believe it
> would be prudent to change the Stackalytics report in question [3] in the
> following ways:
>
> a. Change the "Success %" column header to "% Reported +1 Votes"
> b. Change the phrase " Green cell - tests ran successfully, red cell -
> tests failed" to "Green cell - System voted +1, red cell - System voted -1"
>
That makes sense to me.
>
> and then, when we have more and better data (for example, # tests passed,
> failed, skipped, etc), we can provide more detailed information than just
> "reported +1" or not.
>
I think it should not be too hard to start adding minimal measures such as
"% of voted patches"
>
> Thoughts?
>
> Best,
> -jay
>
> [1] http://lists.openstack.org/pipermail/openstack-dev/2014-
> June/038933.html
> [2] http://eavesdrop.openstack.org/meetings/third_party/2014/
> third_party.2014-06-30-18.01.log.html
> [3] http://stackalytics.com/report/ci/neutron/7
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140703/c4d8c198/attachment.html>
More information about the OpenStack-dev
mailing list