Re: [placement][tripleo][infra] zuul job dependencies for greater good?

28 Feb 2019

      On Thu, Feb 28, 2019 at 9:27 AM Clark Boylan <cboylan@sapwetik.org> wrote:
...
On Thu, Feb 28, 2019, at 6:16 AM, Bogdan Dobrelya wrote:
...
...
On Mon, 2019-02-25 at 19:42 -0500, Clark Boylan wrote:
...
On Mon, Feb 25, 2019, at 12:51 PM, Ben Nemec wrote:
snip
...
That said, I wouldn't push too hard in either direction until someone
crunched the numbers and figured out how much time it would have saved
to not run long tests on patch sets with failing unit tests. I feel like
it's probably possible to figure that out, and if so then we should do
it before making any big decisions on this.
For numbers the elastic-recheck tool [0] gives us fairly accurate tracking of which issues in the system cause tests to fail. You can use this as a starting point to potentially figure out how expensive indentation errors caught by the pep8 jobs ends up being or how often unittests fail. You probably need to tweak the queries there to get that specific though.
Periodically I also dump node resource utilization by project, repo, and job [1]. I haven't automated this because Tobiash has written a much better thing that has Zuul inject this into graphite and we should be able to set up a grafana dashboard for that in the future instead.
These numbers won't tell a whole story, but should paint a fairly accurate high level picture of the types of things we should look at to be more node efficient and "time in gate" efficient. Looking at these two really quickly myself it seems that job timeouts are a big cost (anyone looking into why our jobs timeout?).
[0] http://status.openstack.org/elastic-recheck/index.html
[1] http://paste.openstack.org/show/746083/
Hope this helps,
Clark
Here is some numbers [0] extracted via elastic-recheck console queries.
It shows 6% of wasted failures because of tox issues in general, and 3%
for tripleo projects in particular.
Are these wasted failures? The queries appear to be tracking valid failures of those jobs. These valid failures are then actionable feedback for developers to fix their changes.
If we expect these failures to go away we'll need to be much more forceful about getting developers to run tox locally before they push.
We need to compare (and this is a rough example) the resource usage and developer time of complete batch results where you have a pep8 issue and an integration job issue in patchset one, fix both and all tests pass in patchset two against pep8 failure in patchset one, integration failure in patchset two, all tests pass in patchset three.
Today:
  patchset one:
    pep8 FAILURE
    unittest SUCCESS
    integration FAILURE
  patchset two:
    pep8 SUCCESS
    unittest SUCCESS
    integration SUCCESS
Proposed Future:
  patchset one:
    pep8 FAILURE
    unittest SUCCESS
  patchset two:
    pep8 SUCCESS
    unittest SUCCESS
    integration FAILURE
  patchset three:
    pep8 SUCCESS
    unittest SUCCESS
    integration SUCCESS
There are strictly more patchsets (developer roundtrips) and tests run in my contrived example. Reality will depend on how many iterations we actually see in the real world (eg are we fixing bugs reliably based on test feedback and how often do unittest and integration tests fail for different reasons).
So I think the thought behind splitting this up is really project
specific.  For core openstack python projects it might make sense to
not split them apart and run them all together.  For things like
TripleO/Kolla/Puppet/etc where we have layers of interactions that can
be affected by the results from the linters/unit jobs it might make
sense to split them out as Bogdan proposes.

For example, in tripleo since we use packages, if the unit test fails
the integration test may fail because when we go to built the package
with the new source, the unit test int he package build fails.  Thus
we know that'll be a wasted execution and you won't actually get any
results.

An alternative Today:
  patchset one:
    pep8 SUCCESS
    unittest FAILURE
    integration FAILURE
  patchset two:
    pep8 SUCCESS
    unittest SUCCESS
    integration FAILURE
  patchset three:
    pep8 SUCCESS
    unittest SUCCESS
    integration SUCCESS

Future:
  patchset one:
    pep8 SUCCESS
    unittest FAILURE
    integration SKIPPED
  patchset two:
    pep8 SUCCESS
    unittest SUCCESS
    integration FAILURE
  patchset three:
    pep8 SUCCESS
    unittest SUCCESS
    integration SUCCESS

This may not be true for devstack but IMHO I would argue if the unit
tests are failing, then the code is likely bad (backwards
compatibility/wrong assumptions about change/etc) and we shouldn't be
running an actual deployment.
...
Someone with a better background in statistics will probably tell me this approach is wrong, but using the elasticsearch tooling one approach may be to pick say 1k changes, then for each change identify which tests failed on subsequent patchsets? Then we'd be able to infer with some confidence the behaviors we have in the test suites around catching failures, whether they are independent across jobs and whether or not we fix them in batches.
Finally, I think we might also want to consider what the ideal is here. If we find we can optimize the current system for current behaviors, we also need to consider if that is worthwhile given an ideal. Should developers be expected to run unittests and linters locally before pushing? If so then optimizing for when they don't might be effort better spent on making it easier to run the unittests and linters locally and educating developers on how to do so. I think we'd also ideally expect our tests to pass once they run in the gate pipeline. Unfortunately I think our elastic-recheck data shows they often don't and more effort in fixing those failures would provide a dramatic increase in throughput (due to the compounding costs of subsequent gate resets).
Yes I would hope that developers are running tox -e pep8/tox -e py36
prior to pushing code.  Should they be expected to run integration
tests? Probably not.

<rant>
I feel that allowing integration tests without pushing for better unit
tests actually reduces the coverage of the code in unit tests.  If I
make a change and don't bother it covering it with a decent set of
unit tests but it passes the integration tests, would we is it tested
well enough?  From what I've seen the integration tests only test a
fraction of actual functionality so I feel we really should be pushing
people to improve the unit tests and I wonder if splitting this up
would force the issue?
</rant>

Thanks,
-Alex
...
...
My final take is, given some middle-ground solution, like I illustrated
earlier this sub-thread, it might be worth it, and the effort for
boosting up the total throughput of openstack CI system by a 6% is not
so bad idea.
[0] http://paste.openstack.org/show/746503/
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando

Re: [placement][tripleo][infra] zuul job dependencies for greater good?

Alex Schultz