[openstack-dev] [third-party-ci][neutron] What is "Success" exactly?

Anita Kuno anteaya at anteaya.info
Thu Jul 3 13:49:45 UTC 2014


On 07/03/2014 07:12 AM, Salvatore Orlando wrote:
> Apologies for quoting again the top post of the thread.
> 
> Comments inline (mostly thinking aloud)
> Salvatore
> 
> 
> On 30 June 2014 22:22, Jay Pipes <jaypipes at gmail.com> wrote:
> 
>> Hi Stackers,
>>
>> Some recent ML threads [1] and a hot IRC meeting today [2] brought up some
>> legitimate questions around how a newly-proposed Stackalytics report page
>> for Neutron External CI systems [2] represented the results of an external
>> CI system as "successful" or not.
>>
>> First, I want to say that Ilya and all those involved in the Stackalytics
>> program simply want to provide the most accurate information to developers
>> in a format that is easily consumed. While there need to be some changes in
>> how data is shown (and the wording of things like "Tests Succeeded"), I
>> hope that the community knows there isn't any ill intent on the part of
>> Mirantis or anyone who works on Stackalytics. OK, so let's keep the
>> conversation civil -- we're all working towards the same goals of
>> transparency and accuracy. :)
>>
>> Alright, now, Anita and Kurt Taylor were asking a very poignant question:
>>
>> "But what does CI tested really mean? just running tests? or tested to
>> pass some level of requirements?"
>>
>> In this nascent world of external CI systems, we have a set of issues that
>> we need to resolve:
>>
>> 1) All of the CI systems are different.
>>
>> Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts.
>> Others run custom Python code that spawns VMs and publishes logs to some
>> public domain.
>>
>> As a community, we need to decide whether it is worth putting in the
>> effort to create a single, unified, installable and runnable CI system, so
>> that we can legitimately say "all of the external systems are identical,
>> with the exception of the driver code for vendor X being substituted in the
>> Neutron codebase."
>>
> 
> I think such system already exists, and it's documented here:
> http://ci.openstack.org/
> Still, understanding it is quite a learning curve, and running it is not
> exactly straightforward. But I guess that's pretty much understandable
> given the complexity of the system, isn't it?
> 
> 
>>
>> If the goal of the external CI systems is to produce reliable, consistent
>> results, I feel the answer to the above is "yes", but I'm interested to
>> hear what others think. Frankly, in the world of benchmarks, it would be
>> unthinkable to say "go ahead and everyone run your own benchmark suite",
>> because you would get wildly different results. A similar problem has
>> emerged here.
>>
> 
> I don't think the particular infrastructure which might range from an
> openstack-ci clone to a 100-line bash script would have an impact on the
> "reliability" of the quality assessment regarding a particular driver or
> plugin. This is determined, in my opinion, by the quantity and nature of
> tests one runs on a specific driver. In Neutron for instance, there is a
> wide range of choices - from a few test cases in tempest.api.network to the
> full smoketest job. As long there is no minimal standard here, then it
> would be difficult to assess the quality of the evaluation from a CI
> system, unless we explicitly keep into account coverage into the evaluation.
> 
> On the other hand, different CI infrastructures will have different levels
> in terms of % of patches tested and % of infrastructure failures. I think
> it might not be a terrible idea to use these parameters to evaluate how
> good a CI is from an infra standpoint. However, there are still open
> questions. For instance, a CI might have a low patch % score because it
> only needs to test patches affecting a given driver.
> 
> 
>> 2) There is no mediation or verification that the external CI system is
>> actually testing anything at all
>>
>> As a community, we need to decide whether the current system of
>> self-policing should continue. If it should, then language on reports like
>> [3] should be very clear that any numbers derived from such systems should
>> be taken with a grain of salt. Use of the word "Success" should be avoided,
>> as it has connotations (in English, at least) that the result has been
>> verified, which is simply not the case as long as no verification or
>> mediation occurs for any external CI system.
>>
> 
> 
> 
> 
>> 3) There is no clear indication of what tests are being run, and therefore
>> there is no clear indication of what "success" is
>>
>> I think we can all agree that a test has three possible outcomes: pass,
>> fail, and skip. The results of a test suite run therefore is nothing more
>> than the aggregation of which tests passed, which failed, and which were
>> skipped.
>>
>> As a community, we must document, for each project, what are expected set
>> of tests that must be run for each merged patch into the project's source
>> tree. This documentation should be discoverable so that reports like [3]
>> can be crystal-clear on what the data shown actually means. The report is
>> simply displaying the data it receives from Gerrit. The community needs to
>> be proactive in saying "this is what is expected to be tested." This alone
>> would allow the report to give information such as "External CI system ABC
>> performed the expected tests. X tests passed. Y tests failed. Z tests were
>> skipped." Likewise, it would also make it possible for the report to give
>> information such as "External CI system DEF did not perform the expected
>> tests.", which is excellent information in and of itself.
>>
>>
> Agreed. In Neutron we have enforced CIs but not yet agreed on what's the
> minimum set of tests we expect them to run. I reckon this will be fixed
> soon.
> 
> I'll try to look at what "SUCCESS" is from a naive standpoint: a CI says
> "SUCCESS" if the test suite it rans passed; then one should have means to
> understand whether a CI might blatantly lie or tell "half truths". For
> instance saying it passes tempest.api.network while
> tempest.scenario.test_network_basic_ops has not been executed is a half
> truth, in my opinion.
> Stackalitycs can help here, I think. One could create "CI classes"
> according to how much they're close to the level of the upstream gate, and
> then parse results posted to classify CIs. Now, before cursing me, I
> totally understand that this won't be easy at all to implement!
> Furthermore, I don't know whether how this should be reflected in gerrit.
> 
> 
>> ===
>>
>> In thinking about the likely answers to the above questions, I believe it
>> would be prudent to change the Stackalytics report in question [3] in the
>> following ways:
>>
>> a. Change the "Success %" column header to "% Reported +1 Votes"
>> b. Change the phrase " Green cell - tests ran successfully, red cell -
>> tests failed" to "Green cell - System voted +1, red cell - System voted -1"
>>
> 
> That makes sense to me.
> 
> 
>>
>> and then, when we have more and better data (for example, # tests passed,
>> failed, skipped, etc), we can provide more detailed information than just
>> "reported +1" or not.
>>
> 
> I think it should not be too hard to start adding minimal measures such as
> "% of voted patches"
> 
>>
>> Thoughts?
>>
>> Best,
>> -jay
>>
>> [1] http://lists.openstack.org/pipermail/openstack-dev/2014-
>> June/038933.html
>> [2] http://eavesdrop.openstack.org/meetings/third_party/2014/
>> third_party.2014-06-30-18.01.log.html
>> [3] http://stackalytics.com/report/ci/neutron/7
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
> 
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
Thanks for sharing your thoughts, Salvadore.

Some additional things to look at:

Sean Dague has created a tool in stackforge gerrit-dash-creator:
http://git.openstack.org/cgit/stackforge/gerrit-dash-creator/tree/README.rst
which has the ability to make interesting queries on gerrit results. One
such example can be found here: http://paste.openstack.org/show/85416/
(Note when this url was created there was a bug in the syntax and this
url works in chrome but not firefox, Sean tells me the firefox bug has
been addressed - though this url hasn't been altered with the new syntax
yet)

This allows the viewer to see categories of reviews based upon their
divergence from OpenStack's Jenkins results. I think evaluating
divergence from Jenkins might be a metric worth consideration.

Also a gui representation worth looking at is Mikal Still's gui for
Neutron ci health:
http://www.rcbops.com/gerrit/reports/neutron-cireport.html
and Nova ci health: http://www.rcbops.com/gerrit/reports/nova-cireport.html

I don't know the details of how the graphs are calculated in these
pages, but being able to view passed/failed/missed and compare them to
Jenkins is an interesting approach and I feel has some merit.

Thanks I think we are getting some good information out in this thread
and look forward to hearing more thoughts.

Thank you,
Anita.



More information about the OpenStack-dev mailing list