<div dir="ltr">><span style="font-family:arial,sans-serif;font-size:13.333333969116211px">This allows the viewer to see categories of reviews based upon their</span><br style="font-family:arial,sans-serif;font-size:13.333333969116211px">
<span style="font-family:arial,sans-serif;font-size:13.333333969116211px">>divergence from OpenStack's Jenkins results. I think evaluating</span><br style="font-family:arial,sans-serif;font-size:13.333333969116211px">
<span style="font-family:arial,sans-serif;font-size:13.333333969116211px">>divergence from Jenkins might be a metric worth consideration.</span><div><span style="font-family:arial,sans-serif;font-size:13.333333969116211px"><br>
</span></div><div><span style="font-family:arial,sans-serif;font-size:13.333333969116211px">I think the only thing this really reflects though is how much the third party CI system is mirroring Jenkins.</span></div><div>
<font face="arial, sans-serif">A system that frequently diverges may be functioning perfectly fine and just has a vastly different code path that it is integration testing so it is legitimately detecting failures the OpenStack CI cannot.</font></div>
<div><span style="font-family:arial,sans-serif;font-size:13.333333969116211px"><br></span></div><div><span style="font-family:arial,sans-serif;font-size:13.333333969116211px">--</span></div><div><span style="font-family:arial,sans-serif;font-size:13.333333969116211px">Kevin Benton</span></div>
</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Jul 3, 2014 at 6:49 AM, Anita Kuno <span dir="ltr"><<a href="mailto:anteaya@anteaya.info" target="_blank">anteaya@anteaya.info</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 07/03/2014 07:12 AM, Salvatore Orlando wrote:<br>
> Apologies for quoting again the top post of the thread.<br>
><br>
> Comments inline (mostly thinking aloud)<br>
> Salvatore<br>
><br>
><br>
> On 30 June 2014 22:22, Jay Pipes <<a href="mailto:jaypipes@gmail.com">jaypipes@gmail.com</a>> wrote:<br>
><br>
>> Hi Stackers,<br>
>><br>
>> Some recent ML threads [1] and a hot IRC meeting today [2] brought up some<br>
>> legitimate questions around how a newly-proposed Stackalytics report page<br>
>> for Neutron External CI systems [2] represented the results of an external<br>
>> CI system as "successful" or not.<br>
>><br>
>> First, I want to say that Ilya and all those involved in the Stackalytics<br>
>> program simply want to provide the most accurate information to developers<br>
>> in a format that is easily consumed. While there need to be some changes in<br>
>> how data is shown (and the wording of things like "Tests Succeeded"), I<br>
>> hope that the community knows there isn't any ill intent on the part of<br>
>> Mirantis or anyone who works on Stackalytics. OK, so let's keep the<br>
>> conversation civil -- we're all working towards the same goals of<br>
>> transparency and accuracy. :)<br>
>><br>
>> Alright, now, Anita and Kurt Taylor were asking a very poignant question:<br>
>><br>
>> "But what does CI tested really mean? just running tests? or tested to<br>
>> pass some level of requirements?"<br>
>><br>
>> In this nascent world of external CI systems, we have a set of issues that<br>
>> we need to resolve:<br>
>><br>
>> 1) All of the CI systems are different.<br>
>><br>
>> Some run Bash scripts. Some run Jenkins slaves and devstack-gate scripts.<br>
>> Others run custom Python code that spawns VMs and publishes logs to some<br>
>> public domain.<br>
>><br>
>> As a community, we need to decide whether it is worth putting in the<br>
>> effort to create a single, unified, installable and runnable CI system, so<br>
>> that we can legitimately say "all of the external systems are identical,<br>
>> with the exception of the driver code for vendor X being substituted in the<br>
>> Neutron codebase."<br>
>><br>
><br>
> I think such system already exists, and it's documented here:<br>
> <a href="http://ci.openstack.org/" target="_blank">http://ci.openstack.org/</a><br>
> Still, understanding it is quite a learning curve, and running it is not<br>
> exactly straightforward. But I guess that's pretty much understandable<br>
> given the complexity of the system, isn't it?<br>
><br>
><br>
>><br>
>> If the goal of the external CI systems is to produce reliable, consistent<br>
>> results, I feel the answer to the above is "yes", but I'm interested to<br>
>> hear what others think. Frankly, in the world of benchmarks, it would be<br>
>> unthinkable to say "go ahead and everyone run your own benchmark suite",<br>
>> because you would get wildly different results. A similar problem has<br>
>> emerged here.<br>
>><br>
><br>
> I don't think the particular infrastructure which might range from an<br>
> openstack-ci clone to a 100-line bash script would have an impact on the<br>
> "reliability" of the quality assessment regarding a particular driver or<br>
> plugin. This is determined, in my opinion, by the quantity and nature of<br>
> tests one runs on a specific driver. In Neutron for instance, there is a<br>
> wide range of choices - from a few test cases in tempest.api.network to the<br>
> full smoketest job. As long there is no minimal standard here, then it<br>
> would be difficult to assess the quality of the evaluation from a CI<br>
> system, unless we explicitly keep into account coverage into the evaluation.<br>
><br>
> On the other hand, different CI infrastructures will have different levels<br>
> in terms of % of patches tested and % of infrastructure failures. I think<br>
> it might not be a terrible idea to use these parameters to evaluate how<br>
> good a CI is from an infra standpoint. However, there are still open<br>
> questions. For instance, a CI might have a low patch % score because it<br>
> only needs to test patches affecting a given driver.<br>
><br>
><br>
>> 2) There is no mediation or verification that the external CI system is<br>
>> actually testing anything at all<br>
>><br>
>> As a community, we need to decide whether the current system of<br>
>> self-policing should continue. If it should, then language on reports like<br>
>> [3] should be very clear that any numbers derived from such systems should<br>
>> be taken with a grain of salt. Use of the word "Success" should be avoided,<br>
>> as it has connotations (in English, at least) that the result has been<br>
>> verified, which is simply not the case as long as no verification or<br>
>> mediation occurs for any external CI system.<br>
>><br>
><br>
><br>
><br>
><br>
>> 3) There is no clear indication of what tests are being run, and therefore<br>
>> there is no clear indication of what "success" is<br>
>><br>
>> I think we can all agree that a test has three possible outcomes: pass,<br>
>> fail, and skip. The results of a test suite run therefore is nothing more<br>
>> than the aggregation of which tests passed, which failed, and which were<br>
>> skipped.<br>
>><br>
>> As a community, we must document, for each project, what are expected set<br>
>> of tests that must be run for each merged patch into the project's source<br>
>> tree. This documentation should be discoverable so that reports like [3]<br>
>> can be crystal-clear on what the data shown actually means. The report is<br>
>> simply displaying the data it receives from Gerrit. The community needs to<br>
>> be proactive in saying "this is what is expected to be tested." This alone<br>
>> would allow the report to give information such as "External CI system ABC<br>
>> performed the expected tests. X tests passed. Y tests failed. Z tests were<br>
>> skipped." Likewise, it would also make it possible for the report to give<br>
>> information such as "External CI system DEF did not perform the expected<br>
>> tests.", which is excellent information in and of itself.<br>
>><br>
>><br>
> Agreed. In Neutron we have enforced CIs but not yet agreed on what's the<br>
> minimum set of tests we expect them to run. I reckon this will be fixed<br>
> soon.<br>
><br>
> I'll try to look at what "SUCCESS" is from a naive standpoint: a CI says<br>
> "SUCCESS" if the test suite it rans passed; then one should have means to<br>
> understand whether a CI might blatantly lie or tell "half truths". For<br>
> instance saying it passes tempest.api.network while<br>
> tempest.scenario.test_network_basic_ops has not been executed is a half<br>
> truth, in my opinion.<br>
> Stackalitycs can help here, I think. One could create "CI classes"<br>
> according to how much they're close to the level of the upstream gate, and<br>
> then parse results posted to classify CIs. Now, before cursing me, I<br>
> totally understand that this won't be easy at all to implement!<br>
> Furthermore, I don't know whether how this should be reflected in gerrit.<br>
><br>
><br>
>> ===<br>
>><br>
>> In thinking about the likely answers to the above questions, I believe it<br>
>> would be prudent to change the Stackalytics report in question [3] in the<br>
>> following ways:<br>
>><br>
>> a. Change the "Success %" column header to "% Reported +1 Votes"<br>
>> b. Change the phrase " Green cell - tests ran successfully, red cell -<br>
>> tests failed" to "Green cell - System voted +1, red cell - System voted -1"<br>
>><br>
><br>
> That makes sense to me.<br>
><br>
><br>
>><br>
>> and then, when we have more and better data (for example, # tests passed,<br>
>> failed, skipped, etc), we can provide more detailed information than just<br>
>> "reported +1" or not.<br>
>><br>
><br>
> I think it should not be too hard to start adding minimal measures such as<br>
> "% of voted patches"<br>
><br>
>><br>
>> Thoughts?<br>
>><br>
>> Best,<br>
>> -jay<br>
>><br>
>> [1] <a href="http://lists.openstack.org/pipermail/openstack-dev/2014-" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2014-</a><br>
>> June/038933.html<br>
>> [2] <a href="http://eavesdrop.openstack.org/meetings/third_party/2014/" target="_blank">http://eavesdrop.openstack.org/meetings/third_party/2014/</a><br>
>> third_party.<a href="tel:2014-06-30-18" value="+12014063018">2014-06-30-18</a>.01.log.html<br>
>> [3] <a href="http://stackalytics.com/report/ci/neutron/7" target="_blank">http://stackalytics.com/report/ci/neutron/7</a><br>
>><br>
>> _______________________________________________<br>
>> OpenStack-dev mailing list<br>
>> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
>> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
>><br>
><br>
><br>
><br>
> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
><br>
</div></div>Thanks for sharing your thoughts, Salvadore.<br>
<br>
Some additional things to look at:<br>
<br>
Sean Dague has created a tool in stackforge gerrit-dash-creator:<br>
<a href="http://git.openstack.org/cgit/stackforge/gerrit-dash-creator/tree/README.rst" target="_blank">http://git.openstack.org/cgit/stackforge/gerrit-dash-creator/tree/README.rst</a><br>
which has the ability to make interesting queries on gerrit results. One<br>
such example can be found here: <a href="http://paste.openstack.org/show/85416/" target="_blank">http://paste.openstack.org/show/85416/</a><br>
(Note when this url was created there was a bug in the syntax and this<br>
url works in chrome but not firefox, Sean tells me the firefox bug has<br>
been addressed - though this url hasn't been altered with the new syntax<br>
yet)<br>
<br>
This allows the viewer to see categories of reviews based upon their<br>
divergence from OpenStack's Jenkins results. I think evaluating<br>
divergence from Jenkins might be a metric worth consideration.<br>
<br>
Also a gui representation worth looking at is Mikal Still's gui for<br>
Neutron ci health:<br>
<a href="http://www.rcbops.com/gerrit/reports/neutron-cireport.html" target="_blank">http://www.rcbops.com/gerrit/reports/neutron-cireport.html</a><br>
and Nova ci health: <a href="http://www.rcbops.com/gerrit/reports/nova-cireport.html" target="_blank">http://www.rcbops.com/gerrit/reports/nova-cireport.html</a><br>
<br>
I don't know the details of how the graphs are calculated in these<br>
pages, but being able to view passed/failed/missed and compare them to<br>
Jenkins is an interesting approach and I feel has some merit.<br>
<br>
Thanks I think we are getting some good information out in this thread<br>
and look forward to hearing more thoughts.<br>
<br>
Thank you,<br>
Anita.<br>
<div class="HOEnZb"><div class="h5"><br>
_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div>Kevin Benton</div>
</div>