[openstack-dev] [third-party-ci][neutron] What is "Success" exactly?
Fawad Khaliq
fawad at plumgrid.com
Thu Jul 3 18:41:22 UTC 2014
On Thu, Jul 3, 2014 at 10:27 AM, Kevin Benton <blak111 at gmail.com> wrote:
> >This allows the viewer to see categories of reviews based upon their
> >divergence from OpenStack's Jenkins results. I think evaluating
> >divergence from Jenkins might be a metric worth consideration.
>
> I think the only thing this really reflects though is how much the third
> party CI system is mirroring Jenkins.
> A system that frequently diverges may be functioning perfectly fine and
> just has a vastly different code path that it is integration testing so it
> is legitimately detecting failures the OpenStack CI cannot.
>
Exactly. +1
>
> --
> Kevin Benton
>
>
> On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno <anteaya at anteaya.info> wrote:
>
>> On 07/03/2014 07:12 AM, Salvatore Orlando wrote:
>> > Apologies for quoting again the top post of the thread.
>> >
>> > Comments inline (mostly thinking aloud)
>> > Salvatore
>> >
>> >
>> > On 30 June 2014 22:22, Jay Pipes <jaypipes at gmail.com> wrote:
>> >
>> >> Hi Stackers,
>> >>
>> >> Some recent ML threads [1] and a hot IRC meeting today [2] brought up
>> some
>> >> legitimate questions around how a newly-proposed Stackalytics report
>> page
>> >> for Neutron External CI systems [2] represented the results of an
>> external
>> >> CI system as "successful" or not.
>> >>
>> >> First, I want to say that Ilya and all those involved in the
>> Stackalytics
>> >> program simply want to provide the most accurate information to
>> developers
>> >> in a format that is easily consumed. While there need to be some
>> changes in
>> >> how data is shown (and the wording of things like "Tests Succeeded"), I
>> >> hope that the community knows there isn't any ill intent on the part of
>> >> Mirantis or anyone who works on Stackalytics. OK, so let's keep the
>> >> conversation civil -- we're all working towards the same goals of
>> >> transparency and accuracy. :)
>> >>
>> >> Alright, now, Anita and Kurt Taylor were asking a very poignant
>> question:
>> >>
>> >> "But what does CI tested really mean? just running tests? or tested to
>> >> pass some level of requirements?"
>> >>
>> >> In this nascent world of external CI systems, we have a set of issues
>> that
>> >> we need to resolve:
>> >>
>> >> 1) All of the CI systems are different.
>> >>
>> >> Some run Bash scripts. Some run Jenkins slaves and devstack-gate
>> scripts.
>> >> Others run custom Python code that spawns VMs and publishes logs to
>> some
>> >> public domain.
>> >>
>> >> As a community, we need to decide whether it is worth putting in the
>> >> effort to create a single, unified, installable and runnable CI
>> system, so
>> >> that we can legitimately say "all of the external systems are
>> identical,
>> >> with the exception of the driver code for vendor X being substituted
>> in the
>> >> Neutron codebase."
>> >>
>> >
>> > I think such system already exists, and it's documented here:
>> > http://ci.openstack.org/
>> > Still, understanding it is quite a learning curve, and running it is not
>> > exactly straightforward. But I guess that's pretty much understandable
>> > given the complexity of the system, isn't it?
>> >
>> >
>> >>
>> >> If the goal of the external CI systems is to produce reliable,
>> consistent
>> >> results, I feel the answer to the above is "yes", but I'm interested to
>> >> hear what others think. Frankly, in the world of benchmarks, it would
>> be
>> >> unthinkable to say "go ahead and everyone run your own benchmark
>> suite",
>> >> because you would get wildly different results. A similar problem has
>> >> emerged here.
>> >>
>> >
>> > I don't think the particular infrastructure which might range from an
>> > openstack-ci clone to a 100-line bash script would have an impact on the
>> > "reliability" of the quality assessment regarding a particular driver or
>> > plugin. This is determined, in my opinion, by the quantity and nature of
>> > tests one runs on a specific driver. In Neutron for instance, there is a
>> > wide range of choices - from a few test cases in tempest.api.network to
>> the
>> > full smoketest job. As long there is no minimal standard here, then it
>> > would be difficult to assess the quality of the evaluation from a CI
>> > system, unless we explicitly keep into account coverage into the
>> evaluation.
>> >
>> > On the other hand, different CI infrastructures will have different
>> levels
>> > in terms of % of patches tested and % of infrastructure failures. I
>> think
>> > it might not be a terrible idea to use these parameters to evaluate how
>> > good a CI is from an infra standpoint. However, there are still open
>> > questions. For instance, a CI might have a low patch % score because it
>> > only needs to test patches affecting a given driver.
>> >
>> >
>> >> 2) There is no mediation or verification that the external CI system is
>> >> actually testing anything at all
>> >>
>> >> As a community, we need to decide whether the current system of
>> >> self-policing should continue. If it should, then language on reports
>> like
>> >> [3] should be very clear that any numbers derived from such systems
>> should
>> >> be taken with a grain of salt. Use of the word "Success" should be
>> avoided,
>> >> as it has connotations (in English, at least) that the result has been
>> >> verified, which is simply not the case as long as no verification or
>> >> mediation occurs for any external CI system.
>> >>
>> >
>> >
>> >
>> >
>> >> 3) There is no clear indication of what tests are being run, and
>> therefore
>> >> there is no clear indication of what "success" is
>> >>
>> >> I think we can all agree that a test has three possible outcomes: pass,
>> >> fail, and skip. The results of a test suite run therefore is nothing
>> more
>> >> than the aggregation of which tests passed, which failed, and which
>> were
>> >> skipped.
>> >>
>> >> As a community, we must document, for each project, what are expected
>> set
>> >> of tests that must be run for each merged patch into the project's
>> source
>> >> tree. This documentation should be discoverable so that reports like
>> [3]
>> >> can be crystal-clear on what the data shown actually means. The report
>> is
>> >> simply displaying the data it receives from Gerrit. The community
>> needs to
>> >> be proactive in saying "this is what is expected to be tested." This
>> alone
>> >> would allow the report to give information such as "External CI system
>> ABC
>> >> performed the expected tests. X tests passed. Y tests failed. Z tests
>> were
>> >> skipped." Likewise, it would also make it possible for the report to
>> give
>> >> information such as "External CI system DEF did not perform the
>> expected
>> >> tests.", which is excellent information in and of itself.
>> >>
>> >>
>> > Agreed. In Neutron we have enforced CIs but not yet agreed on what's the
>> > minimum set of tests we expect them to run. I reckon this will be fixed
>> > soon.
>> >
>> > I'll try to look at what "SUCCESS" is from a naive standpoint: a CI says
>> > "SUCCESS" if the test suite it rans passed; then one should have means
>> to
>> > understand whether a CI might blatantly lie or tell "half truths". For
>> > instance saying it passes tempest.api.network while
>> > tempest.scenario.test_network_basic_ops has not been executed is a half
>> > truth, in my opinion.
>> > Stackalitycs can help here, I think. One could create "CI classes"
>> > according to how much they're close to the level of the upstream gate,
>> and
>> > then parse results posted to classify CIs. Now, before cursing me, I
>> > totally understand that this won't be easy at all to implement!
>> > Furthermore, I don't know whether how this should be reflected in
>> gerrit.
>> >
>> >
>> >> ===
>> >>
>> >> In thinking about the likely answers to the above questions, I believe
>> it
>> >> would be prudent to change the Stackalytics report in question [3] in
>> the
>> >> following ways:
>> >>
>> >> a. Change the "Success %" column header to "% Reported +1 Votes"
>> >> b. Change the phrase " Green cell - tests ran successfully, red cell -
>> >> tests failed" to "Green cell - System voted +1, red cell - System
>> voted -1"
>> >>
>> >
>> > That makes sense to me.
>> >
>> >
>> >>
>> >> and then, when we have more and better data (for example, # tests
>> passed,
>> >> failed, skipped, etc), we can provide more detailed information than
>> just
>> >> "reported +1" or not.
>> >>
>> >
>> > I think it should not be too hard to start adding minimal measures such
>> as
>> > "% of voted patches"
>> >
>> >>
>> >> Thoughts?
>> >>
>> >> Best,
>> >> -jay
>> >>
>> >> [1] http://lists.openstack.org/pipermail/openstack-dev/2014-
>> >> June/038933.html
>> >> [2] http://eavesdrop.openstack.org/meetings/third_party/2014/
>> >> third_party.2014-06-30-18.01.log.html
>> >> [3] http://stackalytics.com/report/ci/neutron/7
>> >>
>> >> _______________________________________________
>> >> OpenStack-dev mailing list
>> >> OpenStack-dev at lists.openstack.org
>> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >>
>> >
>> >
>> >
>> > _______________________________________________
>> > OpenStack-dev mailing list
>> > OpenStack-dev at lists.openstack.org
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> >
>> Thanks for sharing your thoughts, Salvadore.
>>
>> Some additional things to look at:
>>
>> Sean Dague has created a tool in stackforge gerrit-dash-creator:
>>
>> http://git.openstack.org/cgit/stackforge/gerrit-dash-creator/tree/README.rst
>> which has the ability to make interesting queries on gerrit results. One
>> such example can be found here: http://paste.openstack.org/show/85416/
>> (Note when this url was created there was a bug in the syntax and this
>> url works in chrome but not firefox, Sean tells me the firefox bug has
>> been addressed - though this url hasn't been altered with the new syntax
>> yet)
>>
>> This allows the viewer to see categories of reviews based upon their
>> divergence from OpenStack's Jenkins results. I think evaluating
>> divergence from Jenkins might be a metric worth consideration.
>>
>> Also a gui representation worth looking at is Mikal Still's gui for
>> Neutron ci health:
>> http://www.rcbops.com/gerrit/reports/neutron-cireport.html
>> and Nova ci health:
>> http://www.rcbops.com/gerrit/reports/nova-cireport.html
>>
>> I don't know the details of how the graphs are calculated in these
>> pages, but being able to view passed/failed/missed and compare them to
>> Jenkins is an interesting approach and I feel has some merit.
>>
>> Thanks I think we are getting some good information out in this thread
>> and look forward to hearing more thoughts.
>>
>> Thank you,
>> Anita.
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
>
> --
> Kevin Benton
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140703/815ea607/attachment.html>
More information about the OpenStack-dev
mailing list