[openstack-dev] [third-party-ci][neutron] What is "Success" exactly?

Kevin Benton blak111 at gmail.com
Thu Jul 3 20:34:56 UTC 2014


Yes, I can propose a spec for that. It probably won't be until Monday.
Is that okay?


On Thu, Jul 3, 2014 at 11:42 AM, Anita Kuno <anteaya at anteaya.info> wrote:

> On 07/03/2014 02:33 PM, Kevin Benton wrote:
> > Maybe we can require period checks against the head of the master
> > branch (which should always pass) and build statistics based on the
> results
> > of that.
> I like this suggestion. I really like this suggestion.
>
> Hmmmm, what to do with a good suggestion? I wonder if we could capture
> it in an infra-spec and work on it from there.
>
> Would you feel comfortable offering a draft as an infra-spec and then
> perhaps we can discuss the design through the spec?
>
> What do you think?
>
> Thanks Kevin,
> Anita.
>
> > Otherwise it seems like we have to take a CI system's word for it
> > that a particular patch indeed broke that system.
> >
> > --
> > Kevin Benton
> >
> >
> > On Thu, Jul 3, 2014 at 11:07 AM, Anita Kuno <anteaya at anteaya.info>
> wrote:
> >
> >> On 07/03/2014 01:27 PM, Kevin Benton wrote:
> >>>> This allows the viewer to see categories of reviews based upon their
> >>>> divergence from OpenStack's Jenkins results. I think evaluating
> >>>> divergence from Jenkins might be a metric worth consideration.
> >>>
> >>> I think the only thing this really reflects though is how much the
> third
> >>> party CI system is mirroring Jenkins.
> >>> A system that frequently diverges may be functioning perfectly fine and
> >>> just has a vastly different code path that it is integration testing so
> >> it
> >>> is legitimately detecting failures the OpenStack CI cannot.
> >> Great.
> >>
> >> How do we measure the degree to which it is legitimately detecting
> >> failures?
> >>
> >> Thanks Kevin,
> >> Anita.
> >>>
> >>> --
> >>> Kevin Benton
> >>>
> >>>
> >>> On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno <anteaya at anteaya.info>
> wrote:
> >>>
> >>>> On 07/03/2014 07:12 AM, Salvatore Orlando wrote:
> >>>>> Apologies for quoting again the top post of the thread.
> >>>>>
> >>>>> Comments inline (mostly thinking aloud)
> >>>>> Salvatore
> >>>>>
> >>>>>
> >>>>> On 30 June 2014 22:22, Jay Pipes <jaypipes at gmail.com> wrote:
> >>>>>
> >>>>>> Hi Stackers,
> >>>>>>
> >>>>>> Some recent ML threads [1] and a hot IRC meeting today [2] brought
> up
> >>>> some
> >>>>>> legitimate questions around how a newly-proposed Stackalytics report
> >>>> page
> >>>>>> for Neutron External CI systems [2] represented the results of an
> >>>> external
> >>>>>> CI system as "successful" or not.
> >>>>>>
> >>>>>> First, I want to say that Ilya and all those involved in the
> >>>> Stackalytics
> >>>>>> program simply want to provide the most accurate information to
> >>>> developers
> >>>>>> in a format that is easily consumed. While there need to be some
> >>>> changes in
> >>>>>> how data is shown (and the wording of things like "Tests
> Succeeded"),
> >> I
> >>>>>> hope that the community knows there isn't any ill intent on the part
> >> of
> >>>>>> Mirantis or anyone who works on Stackalytics. OK, so let's keep the
> >>>>>> conversation civil -- we're all working towards the same goals of
> >>>>>> transparency and accuracy. :)
> >>>>>>
> >>>>>> Alright, now, Anita and Kurt Taylor were asking a very poignant
> >>>> question:
> >>>>>>
> >>>>>> "But what does CI tested really mean? just running tests? or tested
> to
> >>>>>> pass some level of requirements?"
> >>>>>>
> >>>>>> In this nascent world of external CI systems, we have a set of
> issues
> >>>> that
> >>>>>> we need to resolve:
> >>>>>>
> >>>>>> 1) All of the CI systems are different.
> >>>>>>
> >>>>>> Some run Bash scripts. Some run Jenkins slaves and devstack-gate
> >>>> scripts.
> >>>>>> Others run custom Python code that spawns VMs and publishes logs to
> >> some
> >>>>>> public domain.
> >>>>>>
> >>>>>> As a community, we need to decide whether it is worth putting in the
> >>>>>> effort to create a single, unified, installable and runnable CI
> >> system,
> >>>> so
> >>>>>> that we can legitimately say "all of the external systems are
> >> identical,
> >>>>>> with the exception of the driver code for vendor X being substituted
> >> in
> >>>> the
> >>>>>> Neutron codebase."
> >>>>>>
> >>>>>
> >>>>> I think such system already exists, and it's documented here:
> >>>>> http://ci.openstack.org/
> >>>>> Still, understanding it is quite a learning curve, and running it is
> >> not
> >>>>> exactly straightforward. But I guess that's pretty much
> understandable
> >>>>> given the complexity of the system, isn't it?
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> If the goal of the external CI systems is to produce reliable,
> >>>> consistent
> >>>>>> results, I feel the answer to the above is "yes", but I'm interested
> >> to
> >>>>>> hear what others think. Frankly, in the world of benchmarks, it
> would
> >> be
> >>>>>> unthinkable to say "go ahead and everyone run your own benchmark
> >> suite",
> >>>>>> because you would get wildly different results. A similar problem
> has
> >>>>>> emerged here.
> >>>>>>
> >>>>>
> >>>>> I don't think the particular infrastructure which might range from an
> >>>>> openstack-ci clone to a 100-line bash script would have an impact on
> >> the
> >>>>> "reliability" of the quality assessment regarding a particular driver
> >> or
> >>>>> plugin. This is determined, in my opinion, by the quantity and nature
> >> of
> >>>>> tests one runs on a specific driver. In Neutron for instance, there
> is
> >> a
> >>>>> wide range of choices - from a few test cases in tempest.api.network
> to
> >>>> the
> >>>>> full smoketest job. As long there is no minimal standard here, then
> it
> >>>>> would be difficult to assess the quality of the evaluation from a CI
> >>>>> system, unless we explicitly keep into account coverage into the
> >>>> evaluation.
> >>>>>
> >>>>> On the other hand, different CI infrastructures will have different
> >>>> levels
> >>>>> in terms of % of patches tested and % of infrastructure failures. I
> >> think
> >>>>> it might not be a terrible idea to use these parameters to evaluate
> how
> >>>>> good a CI is from an infra standpoint. However, there are still open
> >>>>> questions. For instance, a CI might have a low patch % score because
> it
> >>>>> only needs to test patches affecting a given driver.
> >>>>>
> >>>>>
> >>>>>> 2) There is no mediation or verification that the external CI system
> >> is
> >>>>>> actually testing anything at all
> >>>>>>
> >>>>>> As a community, we need to decide whether the current system of
> >>>>>> self-policing should continue. If it should, then language on
> reports
> >>>> like
> >>>>>> [3] should be very clear that any numbers derived from such systems
> >>>> should
> >>>>>> be taken with a grain of salt. Use of the word "Success" should be
> >>>> avoided,
> >>>>>> as it has connotations (in English, at least) that the result has
> been
> >>>>>> verified, which is simply not the case as long as no verification or
> >>>>>> mediation occurs for any external CI system.
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> 3) There is no clear indication of what tests are being run, and
> >>>> therefore
> >>>>>> there is no clear indication of what "success" is
> >>>>>>
> >>>>>> I think we can all agree that a test has three possible outcomes:
> >> pass,
> >>>>>> fail, and skip. The results of a test suite run therefore is nothing
> >>>> more
> >>>>>> than the aggregation of which tests passed, which failed, and which
> >> were
> >>>>>> skipped.
> >>>>>>
> >>>>>> As a community, we must document, for each project, what are
> expected
> >>>> set
> >>>>>> of tests that must be run for each merged patch into the project's
> >>>> source
> >>>>>> tree. This documentation should be discoverable so that reports like
> >> [3]
> >>>>>> can be crystal-clear on what the data shown actually means. The
> report
> >>>> is
> >>>>>> simply displaying the data it receives from Gerrit. The community
> >> needs
> >>>> to
> >>>>>> be proactive in saying "this is what is expected to be tested." This
> >>>> alone
> >>>>>> would allow the report to give information such as "External CI
> system
> >>>> ABC
> >>>>>> performed the expected tests. X tests passed. Y tests failed. Z
> tests
> >>>> were
> >>>>>> skipped." Likewise, it would also make it possible for the report to
> >>>> give
> >>>>>> information such as "External CI system DEF did not perform the
> >> expected
> >>>>>> tests.", which is excellent information in and of itself.
> >>>>>>
> >>>>>>
> >>>>> Agreed. In Neutron we have enforced CIs but not yet agreed on what's
> >> the
> >>>>> minimum set of tests we expect them to run. I reckon this will be
> fixed
> >>>>> soon.
> >>>>>
> >>>>> I'll try to look at what "SUCCESS" is from a naive standpoint: a CI
> >> says
> >>>>> "SUCCESS" if the test suite it rans passed; then one should have
> means
> >> to
> >>>>> understand whether a CI might blatantly lie or tell "half truths".
> For
> >>>>> instance saying it passes tempest.api.network while
> >>>>> tempest.scenario.test_network_basic_ops has not been executed is a
> half
> >>>>> truth, in my opinion.
> >>>>> Stackalitycs can help here, I think. One could create "CI classes"
> >>>>> according to how much they're close to the level of the upstream
> gate,
> >>>> and
> >>>>> then parse results posted to classify CIs. Now, before cursing me, I
> >>>>> totally understand that this won't be easy at all to implement!
> >>>>> Furthermore, I don't know whether how this should be reflected in
> >> gerrit.
> >>>>>
> >>>>>
> >>>>>> ===
> >>>>>>
> >>>>>> In thinking about the likely answers to the above questions, I
> believe
> >>>> it
> >>>>>> would be prudent to change the Stackalytics report in question [3]
> in
> >>>> the
> >>>>>> following ways:
> >>>>>>
> >>>>>> a. Change the "Success %" column header to "% Reported +1 Votes"
> >>>>>> b. Change the phrase " Green cell - tests ran successfully, red
> cell -
> >>>>>> tests failed" to "Green cell - System voted +1, red cell - System
> >> voted
> >>>> -1"
> >>>>>>
> >>>>>
> >>>>> That makes sense to me.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> and then, when we have more and better data (for example, # tests
> >>>> passed,
> >>>>>> failed, skipped, etc), we can provide more detailed information than
> >>>> just
> >>>>>> "reported +1" or not.
> >>>>>>
> >>>>>
> >>>>> I think it should not be too hard to start adding minimal measures
> such
> >>>> as
> >>>>> "% of voted patches"
> >>>>>
> >>>>>>
> >>>>>> Thoughts?
> >>>>>>
> >>>>>> Best,
> >>>>>> -jay
> >>>>>>
> >>>>>> [1] http://lists.openstack.org/pipermail/openstack-dev/2014-
> >>>>>> June/038933.html
> >>>>>> [2] http://eavesdrop.openstack.org/meetings/third_party/2014/
> >>>>>> third_party.2014-06-30-18.01.log.html
> >>>>>> [3] http://stackalytics.com/report/ci/neutron/7
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> OpenStack-dev mailing list
> >>>>>> OpenStack-dev at lists.openstack.org
> >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> OpenStack-dev mailing list
> >>>>> OpenStack-dev at lists.openstack.org
> >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>>>
> >>>> Thanks for sharing your thoughts, Salvadore.
> >>>>
> >>>> Some additional things to look at:
> >>>>
> >>>> Sean Dague has created a tool in stackforge gerrit-dash-creator:
> >>>>
> >>>>
> >>
> http://git.openstack.org/cgit/stackforge/gerrit-dash-creator/tree/README.rst
> >>>> which has the ability to make interesting queries on gerrit results.
> One
> >>>> such example can be found here:
> http://paste.openstack.org/show/85416/
> >>>> (Note when this url was created there was a bug in the syntax and this
> >>>> url works in chrome but not firefox, Sean tells me the firefox bug has
> >>>> been addressed - though this url hasn't been altered with the new
> syntax
> >>>> yet)
> >>>>
> >>>> This allows the viewer to see categories of reviews based upon their
> >>>> divergence from OpenStack's Jenkins results. I think evaluating
> >>>> divergence from Jenkins might be a metric worth consideration.
> >>>>
> >>>> Also a gui representation worth looking at is Mikal Still's gui for
> >>>> Neutron ci health:
> >>>> http://www.rcbops.com/gerrit/reports/neutron-cireport.html
> >>>> and Nova ci health:
> >>>> http://www.rcbops.com/gerrit/reports/nova-cireport.html
> >>>>
> >>>> I don't know the details of how the graphs are calculated in these
> >>>> pages, but being able to view passed/failed/missed and compare them to
> >>>> Jenkins is an interesting approach and I feel has some merit.
> >>>>
> >>>> Thanks I think we are getting some good information out in this thread
> >>>> and look forward to hearing more thoughts.
> >>>>
> >>>> Thank you,
> >>>> Anita.
> >>>>
> >>>> _______________________________________________
> >>>> OpenStack-dev mailing list
> >>>> OpenStack-dev at lists.openstack.org
> >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> OpenStack-dev mailing list
> >>> OpenStack-dev at lists.openstack.org
> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>
> >>
> >>
> >> _______________________________________________
> >> OpenStack-dev mailing list
> >> OpenStack-dev at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>
> >
> >
> >
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



-- 
Kevin Benton
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140703/d04f749a/attachment-0001.html>


More information about the OpenStack-dev mailing list