On Thu, Feb 28, 2019 at 9:27 AM Clark Boylan <cboylan@sapwetik.org> wrote:
On Thu, Feb 28, 2019, at 6:16 AM, Bogdan Dobrelya wrote:
On Mon, 2019-02-25 at 19:42 -0500, Clark Boylan wrote:
On Mon, Feb 25, 2019, at 12:51 PM, Ben Nemec wrote:
snip
That said, I wouldn't push too hard in either direction until someone crunched the numbers and figured out how much time it would have saved to not run long tests on patch sets with failing unit tests. I feel like it's probably possible to figure that out, and if so then we should do it before making any big decisions on this.
For numbers the elastic-recheck tool [0] gives us fairly accurate tracking of which issues in the system cause tests to fail. You can use this as a starting point to potentially figure out how expensive indentation errors caught by the pep8 jobs ends up being or how often unittests fail. You probably need to tweak the queries there to get that specific though.
Periodically I also dump node resource utilization by project, repo, and job [1]. I haven't automated this because Tobiash has written a much better thing that has Zuul inject this into graphite and we should be able to set up a grafana dashboard for that in the future instead.
These numbers won't tell a whole story, but should paint a fairly accurate high level picture of the types of things we should look at to be more node efficient and "time in gate" efficient. Looking at these two really quickly myself it seems that job timeouts are a big cost (anyone looking into why our jobs timeout?).
[0] http://status.openstack.org/elastic-recheck/index.html [1] http://paste.openstack.org/show/746083/
Hope this helps, Clark
Here is some numbers [0] extracted via elastic-recheck console queries. It shows 6% of wasted failures because of tox issues in general, and 3% for tripleo projects in particular.
Are these wasted failures? The queries appear to be tracking valid failures of those jobs. These valid failures are then actionable feedback for developers to fix their changes.
If we expect these failures to go away we'll need to be much more forceful about getting developers to run tox locally before they push.
We need to compare (and this is a rough example) the resource usage and developer time of complete batch results where you have a pep8 issue and an integration job issue in patchset one, fix both and all tests pass in patchset two against pep8 failure in patchset one, integration failure in patchset two, all tests pass in patchset three.
Today: patchset one: pep8 FAILURE unittest SUCCESS integration FAILURE patchset two: pep8 SUCCESS unittest SUCCESS integration SUCCESS
Proposed Future: patchset one: pep8 FAILURE unittest SUCCESS patchset two: pep8 SUCCESS unittest SUCCESS integration FAILURE patchset three: pep8 SUCCESS unittest SUCCESS integration SUCCESS
There are strictly more patchsets (developer roundtrips) and tests run in my contrived example. Reality will depend on how many iterations we actually see in the real world (eg are we fixing bugs reliably based on test feedback and how often do unittest and integration tests fail for different reasons).
So I think the thought behind splitting this up is really project specific. For core openstack python projects it might make sense to not split them apart and run them all together. For things like TripleO/Kolla/Puppet/etc where we have layers of interactions that can be affected by the results from the linters/unit jobs it might make sense to split them out as Bogdan proposes. For example, in tripleo since we use packages, if the unit test fails the integration test may fail because when we go to built the package with the new source, the unit test int he package build fails. Thus we know that'll be a wasted execution and you won't actually get any results. An alternative Today: patchset one: pep8 SUCCESS unittest FAILURE integration FAILURE patchset two: pep8 SUCCESS unittest SUCCESS integration FAILURE patchset three: pep8 SUCCESS unittest SUCCESS integration SUCCESS Future: patchset one: pep8 SUCCESS unittest FAILURE integration SKIPPED patchset two: pep8 SUCCESS unittest SUCCESS integration FAILURE patchset three: pep8 SUCCESS unittest SUCCESS integration SUCCESS This may not be true for devstack but IMHO I would argue if the unit tests are failing, then the code is likely bad (backwards compatibility/wrong assumptions about change/etc) and we shouldn't be running an actual deployment.
Someone with a better background in statistics will probably tell me this approach is wrong, but using the elasticsearch tooling one approach may be to pick say 1k changes, then for each change identify which tests failed on subsequent patchsets? Then we'd be able to infer with some confidence the behaviors we have in the test suites around catching failures, whether they are independent across jobs and whether or not we fix them in batches.
Finally, I think we might also want to consider what the ideal is here. If we find we can optimize the current system for current behaviors, we also need to consider if that is worthwhile given an ideal. Should developers be expected to run unittests and linters locally before pushing? If so then optimizing for when they don't might be effort better spent on making it easier to run the unittests and linters locally and educating developers on how to do so. I think we'd also ideally expect our tests to pass once they run in the gate pipeline. Unfortunately I think our elastic-recheck data shows they often don't and more effort in fixing those failures would provide a dramatic increase in throughput (due to the compounding costs of subsequent gate resets).
Yes I would hope that developers are running tox -e pep8/tox -e py36 prior to pushing code. Should they be expected to run integration tests? Probably not. <rant> I feel that allowing integration tests without pushing for better unit tests actually reduces the coverage of the code in unit tests. If I make a change and don't bother it covering it with a decent set of unit tests but it passes the integration tests, would we is it tested well enough? From what I've seen the integration tests only test a fraction of actual functionality so I feel we really should be pushing people to improve the unit tests and I wonder if splitting this up would force the issue? </rant> Thanks, -Alex
My final take is, given some middle-ground solution, like I illustrated earlier this sub-thread, it might be worth it, and the effort for boosting up the total throughput of openstack CI system by a 6% is not so bad idea.
[0] http://paste.openstack.org/show/746503/
-- Best regards, Bogdan Dobrelya, Irc #bogdando