On Thu, Nov 18, 2021, at 7:19 AM, Sean Mooney wrote:
On Thu, 2021-11-18 at 15:39 +0100, Balazs Gibizer wrote:
On Wed, Nov 17 2021 at 07:51:57 AM -0800, Clark Boylan <cboylan@sapwetik.org> wrote:
On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:
Snip. I want to respond to a specific suggestion:
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.
OpenStack has configured its check and gate queues with something we've called "clean check". This refers to the requirement that before an OpenStack project can be gated it must pass check tests first. This policy was instituted because a number of these infrequent but problematic issues were traced back to recheck spamming. Basically changes would show up and were broken. They would fail some percentage of the time. They got rechecked until they finally merged and now their failure rate is added to the whole. This rule was introduced to make it more difficult to get this flakyness into the gate.
Locking in test results is in direct opposition to the existing policy and goals. Locking results would make it far more trivial to land such flakyness as you wouldn't need entire sets of jobs to pass before you could land. Instead you could rerun individual jobs until each one passed and then land the result. Potentially introducing significant flakyness with a single merge.
Locking results is also not really something that fits well with the speculative gate queues that Zuul runs. Remember that Zuul constructs a future git state and tests that in parallel. Currently the state for OpenStack looks like:
A - Nova ^ B - Glance ^ C - Neutron ^ D - Neutron ^ F - Neutron
The B glance change is tested as if the A Nova change has already merged and so on down the queue. If we want to keep these speculative states we can't really have humans manually verify a failure can be ignored and retry it. Because we'd be enqueuing job builds at different stages of speculative state. Each job build would be testing a different version of the software.
What we could do is implement a retry limit for failing jobs. Zuul could rerun failing jobs X times before giving up and reporting failure (this would require updates to Zuul). The problem with this approach is without some oversight it becomes very easy to land changes that make things worse. As a side note Zuul does do retries, but only for detected network errors or when a pre-run playbook fails. The assumption is that network failures are due to the dangers of the Internet, and that pre-run playbooks are small, self contained, unlikely to fail, and when they do fail the failure should be independent of what is being tested.
Where does that leave us?
I think it is worth considering the original goals of "clean check". We know that rechecking/rerunning only makes these problems worse in the long term. They represent technical debt. One of the reasons we run these tests is to show us when our software is broken. In the case of flaky results we are exposing this technical debt where it impacts the functionality of our software. The longer we avoid fixing these issues the worse it gets, and this is true even with "clean check".
Do we as developers find value in knowing the software needs attention before it gets released to users? Do the users find value in running reliable software? In the past we have asserted that "yes, there is value in this", and have invested in tracking, investigating, and fixing these problems even if they happen infrequently. But that does require investment, and active maintenance.
Thank you Clark! I agree with your view that the current setup provides us with very valuable information about the health of the software we are developing. I also agree that our primary goal should be to fix the flaky tests instead of hiding the results under any kind of rechecks.
Still I'm wondering what we will do if it turns out that the existing developer bandwidth shrunk to the point where we simply not have the capacity for fix these technical debts. What the stable team does on stable branches in Extended Maintenance mode in a similar situation is to simply turn off problematic test jobs. So I guess that is also a valid last resort move.
one option is to "trust" the core team more and grant them explict rigth to workflow +2 and force merge a patch.
trust is in quotes because its not really about trusting that the core teams can restrain themselve form blindly merging broken code but more a case of right now we entrust zuul to be the final gate keeper of our repo.
When there are known broken gate failure and we are trying to land specific patch to say nova to fix or unblock the nuetron gate and we can see the neutron DNM patch that depens on this nova fix passsed then we could entrust the core team in this specific case to override zuul.
We do already give you this option via the removal of tests that are invalid/flaky/not useful. I do worry that if we give a complete end around the CI system it will be quickly abused. We stopped requiring a bug on rechecks because we quickly realized that no one was actually debugging the failure and identifying the underlying issue. Instead they would just recheck with an arbitrary or completely wrong bug identified. I expect similar would happen here. And the end result would be that CI would simply get more flaky and unreliable for the next change. If instead we fix or remove the flaky tests/jobs we'll end up with a system that is more reliable for the next change.
i would expect this capablity to be used very spareinly but we do have some intermitent failures that happen that we can tellĀ are unrelated to the patch like the curernt issue with volumne attach/detach that result in kernel panics in the guest. if that is the only failure and all other test passed in gate i think it woudl be reasonable for a the neutron team to approve a neutron patch that modifies security groups for example. its very clearly an unrealted failure.
As noted above, it would also be reasonable to stop running tests that cannot function. We do need to be careful that we don't remove tests and never fix the underlying issues though. We should also remember that if we have these problems in CI there is a high chance that our users will have these problems in production later (we've helped more than one of the infra donor clouds identify bugs straight out of elastic-recheck information in the past so this does happen).
that might be an alternivie to the recheck we have now and by resreving that for the core team it limits the scope for abusing this.
i do think that the orginal goes of green check are good so really i would be suggesting this as an option for when check passed and we get an intermient failure in gate that we woudl override.
this would not adress the issue in check but it would make itermitent failure in gate much less painful.
I tried to make this point in my previous email, but I think we are still fumbling around it. If we provide mechanisms to end around flaky CI instead of fixing flaky CI the end result will be flakier CI. I'm not convinced that we'll be happier with any mechanism that doesn't remove the -1 from happening in the first place. Instead the problems will accelerate and eventually we'll be unable to rely on CI for anything useful.