[neutron][CI] How to reduce number of rechecks - brainstorming - openstack-discuss

newer
[neutron] Instances can't get IP...

[neutron][CI] How to reduce number of rechecks - brainstorming

Slawek Kaplonski

17 Nov 2021 17 Nov '21

12:13 a.m.

Hi, Recently I spent some time to check how many rechecks we need in Neutron to get patch merged and I compared it to some other OpenStack projects (see [1] for details). TL;DR - results aren't good for us and I think we really need to do something with that. Of course "easiest" thing to say is that we should fix issues which we are hitting in the CI to make jobs more stable. But it's not that easy. We are struggling with those jobs for very long time. We have CI related meeting every week and we are fixing what we can there. Unfortunately there is still bunch of issues which we can't fix so far because they are intermittent and hard to reproduce locally or in some cases the issues aren't realy related to the Neutron or there are new bugs which we need to investigate and fix :) So this is never ending battle for us. The problem is that we have to test various backends, drivers, etc. so as a result we have many jobs running on each patch - excluding UT, pep8 and docs jobs we have around 19 jobs in check and 14 jobs in gate queue. In the past we made a lot of improvements, like e.g. we improved irrelevant files lists for jobs to run less jobs on some of the patches, together with QA team we did "integrated-networking" template to run only Neutron and Nova related scenario tests in the Neutron queues, we removed and consolidated some of the jobs (there is still one patch in progress for that but it should just remove around 2 jobs from the check queue). All of that are good improvements but still not enough to make our CI really stable :/ Because of all of that, I would like to ask community about any other ideas how we can improve that. If You have any ideas, please send it in this email thread or reach out to me directly on irc. We want to discuss about them in the next video CI meeting which will be on November 30th. If You would have any idea and would like to join that discussion, You are more than welcome in that meeting of course :) [1] http://lists.openstack.org/pipermail/openstack-discuss/2021-November/ 025759.html -- Slawek Kaplonski Principal Software Engineer Red Hat

Attachments:

signature.asc (application/pgp-signature — 488 bytes)

Show replies by date

Balazs Gibizer

17 Nov 17 Nov

2:18 a.m.

On Wed, Nov 17 2021 at 09:13:34 AM +0100, Slawek Kaplonski <skaplons@redhat.com> wrote:

...

Hi,

Recently I spent some time to check how many rechecks we need in Neutron to get patch merged and I compared it to some other OpenStack projects (see [1] for details). TL;DR - results aren't good for us and I think we really need to do something with that.

I really like the idea of collecting such stats. Thank you for doing it. I can even imagine to make a public dashboard somewhere with this information as it is a good indication about the health of our projects / testing.

...

Of course "easiest" thing to say is that we should fix issues which we are hitting in the CI to make jobs more stable. But it's not that easy. We are struggling with those jobs for very long time. We have CI related meeting every week and we are fixing what we can there. Unfortunately there is still bunch of issues which we can't fix so far because they are intermittent and hard to reproduce locally or in some cases the issues aren't realy related to the Neutron or there are new bugs which we need to investigate and fix :)

I have couple of suggestion based on my experience working with CI in nova. 1) we try to open bug reports for intermittent gate failures too and keep them tagged in a list [1] so when a job fail it is easy to check if the bug is known. 2) I offer my help here now that if you see something in neutron runs that feels non neutron specific then ping me with it. Maybe we are struggling with the same problem too. 3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails. Cheers, gibi

...

So this is never ending battle for us. The problem is that we have to test various backends, drivers, etc. so as a result we have many jobs running on each patch - excluding UT, pep8 and docs jobs we have around 19 jobs in check and 14 jobs in gate queue.

In the past we made a lot of improvements, like e.g. we improved irrelevant files lists for jobs to run less jobs on some of the patches, together with QA team we did "integrated-networking" template to run only Neutron and Nova related scenario tests in the Neutron queues, we removed and consolidated some of the jobs (there is still one patch in progress for that but it should just remove around 2 jobs from the check queue). All of that are good improvements but still not enough to make our CI really stable :/

Because of all of that, I would like to ask community about any other ideas how we can improve that. If You have any ideas, please send it in this email thread or reach out to me directly on irc. We want to discuss about them in the next video CI meeting which will be on November 30th. If You would have any idea and would like to join that discussion, You are more than welcome in that meeting of course :)

[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-November/ 025759.html

[1] https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_last_updated&start=0

...

-- Slawek Kaplonski Principal Software Engineer Red Hat

Clark Boylan

7:51 a.m.

On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:

...

Snip. I want to respond to a specific suggestion:

...

3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

OpenStack has configured its check and gate queues with something we've called "clean check". This refers to the requirement that before an OpenStack project can be gated it must pass check tests first. This policy was instituted because a number of these infrequent but problematic issues were traced back to recheck spamming. Basically changes would show up and were broken. They would fail some percentage of the time. They got rechecked until they finally merged and now their failure rate is added to the whole. This rule was introduced to make it more difficult to get this flakyness into the gate. Locking in test results is in direct opposition to the existing policy and goals. Locking results would make it far more trivial to land such flakyness as you wouldn't need entire sets of jobs to pass before you could land. Instead you could rerun individual jobs until each one passed and then land the result. Potentially introducing significant flakyness with a single merge. Locking results is also not really something that fits well with the speculative gate queues that Zuul runs. Remember that Zuul constructs a future git state and tests that in parallel. Currently the state for OpenStack looks like: A - Nova ^ B - Glance ^ C - Neutron ^ D - Neutron ^ F - Neutron The B glance change is tested as if the A Nova change has already merged and so on down the queue. If we want to keep these speculative states we can't really have humans manually verify a failure can be ignored and retry it. Because we'd be enqueuing job builds at different stages of speculative state. Each job build would be testing a different version of the software. What we could do is implement a retry limit for failing jobs. Zuul could rerun failing jobs X times before giving up and reporting failure (this would require updates to Zuul). The problem with this approach is without some oversight it becomes very easy to land changes that make things worse. As a side note Zuul does do retries, but only for detected network errors or when a pre-run playbook fails. The assumption is that network failures are due to the dangers of the Internet, and that pre-run playbooks are small, self contained, unlikely to fail, and when they do fail the failure should be independent of what is being tested. Where does that leave us? I think it is worth considering the original goals of "clean check". We know that rechecking/rerunning only makes these problems worse in the long term. They represent technical debt. One of the reasons we run these tests is to show us when our software is broken. In the case of flaky results we are exposing this technical debt where it impacts the functionality of our software. The longer we avoid fixing these issues the worse it gets, and this is true even with "clean check". Do we as developers find value in knowing the software needs attention before it gets released to users? Do the users find value in running reliable software? In the past we have asserted that "yes, there is value in this", and have invested in tracking, investigating, and fixing these problems even if they happen infrequently. But that does require investment, and active maintenance. Clark

Slawek Kaplonski

11:46 p.m.

...

On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:

Snip. I want to respond to a specific suggestion:

...
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

OpenStack has configured its check and gate queues with something we've called "clean check". This refers to the requirement that before an OpenStack project can be gated it must pass check tests first. This policy was instituted because a number of these infrequent but problematic issues were traced back to recheck spamming. Basically changes would show up and were broken. They would fail some percentage of the time. They got rechecked until they finally merged and now their failure rate is added to the whole. This rule was introduced to make it more difficult to get this flakyness into the gate.

Locking in test results is in direct opposition to the existing policy and goals. Locking results would make it far more trivial to land such flakyness as you wouldn't need entire sets of jobs to pass before you could land. Instead you could rerun individual jobs until each one passed and then land the result. Potentially introducing significant flakyness with a single merge.

Locking results is also not really something that fits well with the speculative gate queues that Zuul runs. Remember that Zuul constructs a future git state and tests that in parallel. Currently the state for OpenStack looks like:

A - Nova ^ B - Glance ^ C - Neutron ^ D - Neutron ^ F - Neutron

The B glance change is tested as if the A Nova change has already merged and so on down the queue. If we want to keep these speculative states we can't really have humans manually verify a failure can be ignored and retry it. Because we'd be enqueuing job builds at different stages of speculative state. Each job build would be testing a different version of the software.

What we could do is implement a retry limit for failing jobs. Zuul could rerun failing jobs X times before giving up and reporting failure (this would require updates to Zuul). The problem with this approach is without some oversight it becomes very easy to land changes that make things worse. As a side note Zuul does do retries, but only for detected network errors or when a pre-run playbook fails. The assumption is that network failures are due to the dangers of the Internet, and that pre-run playbooks are small, self contained, unlikely to fail, and when they do fail the failure should be independent of what is being tested.

Where does that leave us?

I think it is worth considering the original goals of "clean check". We know that rechecking/rerunning only makes these problems worse in the long term. They represent technical debt. One of the reasons we run these tests is to show us when our software is broken. In the case of flaky results we are exposing this technical debt where it impacts the functionality of our software. The longer we avoid fixing these issues the worse it gets, and

Hi, Thx Clark for detailed explanation about that :) On środa, 17 listopada 2021 16:51:57 CET you wrote: this

...

is true even with "clean check".

I agree with You on that and I would really like to find better/other solution for the Neutron problem than rechecking only broken jobs as I'm pretty sure that this would make things much worst quickly.

...

Do we as developers find value in knowing the software needs attention

before

...

it gets released to users? Do the users find value in running reliable software? In the past we have asserted that "yes, there is value in this", and have invested in tracking, investigating, and fixing these problems even if they happen infrequently. But that does require investment, and active maintenance.

Clark

-- Slawek Kaplonski Principal Software Engineer Red Hat

Oleg Bondarev

28 Nov 28 Nov

11:22 p.m.

...

On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:

Snip. I want to respond to a specific suggestion:

...
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

OpenStack has configured its check and gate queues with something we've called "clean check". This refers to the requirement that before an OpenStack project can be gated it must pass check tests first. This policy was instituted because a number of these infrequent but problematic issues were traced back to recheck spamming. Basically changes would show up and were broken. They would fail some percentage of the time. They got rechecked until they finally merged and now their failure rate is added to the whole. This rule was introduced to make it more difficult to get this flakyness into the gate.

Locking in test results is in direct opposition to the existing policy and goals. Locking results would make it far more trivial to land such flakyness as you wouldn't need entire sets of jobs to pass before you could land. Instead you could rerun individual jobs until each one passed and then land the result. Potentially introducing significant flakyness with a single merge.

Locking results is also not really something that fits well with the speculative gate queues that Zuul runs. Remember that Zuul constructs a future git state and tests that in parallel. Currently the state for OpenStack looks like:

A - Nova ^ B - Glance ^ C - Neutron ^ D - Neutron ^ F - Neutron

The B glance change is tested as if the A Nova change has already merged and so on down the queue. If we want to keep these speculative states we can't really have humans manually verify a failure can be ignored and retry it. Because we'd be enqueuing job builds at different stages of speculative state. Each job build would be testing a different version of the software.

What we could do is implement a retry limit for failing jobs. Zuul could rerun failing jobs X times before giving up and reporting failure (this would require updates to Zuul). The problem with this approach is without some oversight it becomes very easy to land changes that make things worse. As a side note Zuul does do retries, but only for detected network errors or when a pre-run playbook fails. The assumption is that network failures are due to the dangers of the Internet, and that pre-run playbooks are small, self contained, unlikely to fail, and when they do fail the failure should be independent of what is being tested.

Where does that leave us?

I think it is worth considering the original goals of "clean check". We know that rechecking/rerunning only makes these problems worse in the long term. They represent technical debt. One of the reasons we run these tests is to show us when our software is broken. In the case of flaky results we are exposing this technical debt where it impacts the functionality of our software. The longer we avoid fixing these issues the worse it gets, and

Hello, A few thoughts from my side in scope of brainstorm: 1) Recheck actual bugs (“recheck bug 123456”) - not a new idea to better keep track of all failures - force a developer to investigate the reason of each CI failure and increase corresponding bug rating, or file a new bug (or go and fix this bug finally!) - I think we should have some gate failure bugs dashboard with hottest bugs on top (maybe there is one that I’m not aware of) so everyone could go and check if his CI failure is known or new - simple “recheck” could be forbidden, at least during “crisis management” window 2) Allow recheck TIMEOUT/POST_FAILURE jobs - while I agree that re-run particular jobs is evil, TIMEOUT/POST_FAILURE are not related to the patch in majority of cases - performance issues are usually caught by Rally jobs - of course core team should monitor if timeouts become a rule for some jobs 3) Ability to block rechecks in some cases, like known gate blocker - not everyone is always aware that gates are blocked with some issue - PTL (or any core team member) can turn off rechecks during that time (with a message from Zuul) - happens not often but still can save some CI resources Thanks, Oleg --- Advanced Software Technology Lab Huawei -----Original Message----- From: Slawek Kaplonski [mailto:skaplons@redhat.com] Sent: Thursday, November 18, 2021 10:46 AM To: Clark Boylan <cboylan@sapwetik.org> Cc: openstack-discuss@lists.openstack.org Subject: Re: [neutron][CI] How to reduce number of rechecks - brainstorming Hi, Thx Clark for detailed explanation about that :) On środa, 17 listopada 2021 16:51:57 CET you wrote: this

...

is true even with "clean check".

...

Do we as developers find value in knowing the software needs attention

before

...

it gets released to users? Do the users find value in running reliable software? In the past we have asserted that "yes, there is value in this", and have invested in tracking, investigating, and fixing these problems even if they happen infrequently. But that does require investment, and active maintenance.

Clark

-- Slawek Kaplonski Principal Software Engineer Red Hat

Lajos Katona

29 Nov 29 Nov

2:11 a.m.

Hi, I am not sure what is the current status of elastic, but we should use again elastic-recheck, keep the bug definitions up-to-date and dedicate time to keep it alive.

...

From the zuul status page at least it seems it has fresh data: http://status.openstack.org/elastic-recheck/data/integrated_gate.html

It could help reviewers to see feedback from elastic-recheck if the issue in the given patch is an already known bug. https://docs.openstack.org/infra/elastic-recheck/readme.html regards Lajos Katona (lajoskatona) Oleg Bondarev <oleg.bondarev@huawei.com> ezt írta (időpont: 2021. nov. 29., H, 8:35):

...

Hello,

A few thoughts from my side in scope of brainstorm:

1) Recheck actual bugs (“recheck bug 123456”) - not a new idea to better keep track of all failures - force a developer to investigate the reason of each CI failure and increase corresponding bug rating, or file a new bug (or go and fix this bug finally!) - I think we should have some gate failure bugs dashboard with hottest bugs on top (maybe there is one that I’m not aware of) so everyone could go and check if his CI failure is known or new - simple “recheck” could be forbidden, at least during “crisis management” window

2) Allow recheck TIMEOUT/POST_FAILURE jobs - while I agree that re-run particular jobs is evil, TIMEOUT/POST_FAILURE are not related to the patch in majority of cases - performance issues are usually caught by Rally jobs - of course core team should monitor if timeouts become a rule for some jobs

3) Ability to block rechecks in some cases, like known gate blocker - not everyone is always aware that gates are blocked with some issue - PTL (or any core team member) can turn off rechecks during that time (with a message from Zuul) - happens not often but still can save some CI resources

Thanks, Oleg --- Advanced Software Technology Lab Huawei

-----Original Message----- From: Slawek Kaplonski [mailto:skaplons@redhat.com] Sent: Thursday, November 18, 2021 10:46 AM To: Clark Boylan <cboylan@sapwetik.org> Cc: openstack-discuss@lists.openstack.org Subject: Re: [neutron][CI] How to reduce number of rechecks - brainstorming

Hi,

Thx Clark for detailed explanation about that :)

...
On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:

Snip. I want to respond to a specific suggestion:

...
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

OpenStack has configured its check and gate queues with something we've called "clean check". This refers to the requirement that before an OpenStack project can be gated it must pass check tests first. This policy was instituted because a number of these infrequent but problematic issues were traced back to recheck spamming. Basically changes would show up and were broken. They would fail some percentage of the time. They got rechecked until they finally merged and now their failure rate is added to the whole. This rule was introduced to make it more difficult to get this flakyness into the gate.

Locking in test results is in direct opposition to the existing policy and goals. Locking results would make it far more trivial to land such flakyness as you wouldn't need entire sets of jobs to pass before you could land. Instead you could rerun individual jobs until each one passed and then land the result. Potentially introducing significant flakyness with a single merge.

Locking results is also not really something that fits well with the speculative gate queues that Zuul runs. Remember that Zuul constructs a future git state and tests that in parallel. Currently the state for OpenStack looks like:

A - Nova ^ B - Glance ^ C - Neutron ^ D - Neutron ^ F - Neutron

The B glance change is tested as if the A Nova change has already merged and so on down the queue. If we want to keep these speculative states we can't really have humans manually verify a failure can be ignored and retry it. Because we'd be enqueuing job builds at different stages of speculative state. Each job build would be testing a different version of the software.

What we could do is implement a retry limit for failing jobs. Zuul could rerun failing jobs X times before giving up and reporting failure (this would require updates to Zuul). The problem with this approach is without some oversight it becomes very easy to land changes that make things worse. As a side note Zuul does do retries, but only for detected network errors or when a pre-run playbook fails. The assumption is that network failures are due to the dangers of the Internet, and that pre-run playbooks are small, self contained, unlikely to fail, and when they do fail the failure should be independent of what is being tested.

Where does that leave us?

I think it is worth considering the original goals of "clean check". We know that rechecking/rerunning only makes these problems worse in the long term. They represent technical debt. One of the reasons we run these tests is to show us when our software is broken. In the case of flaky results we are exposing this technical debt where it impacts the functionality of our software. The longer we avoid fixing these issues the worse it gets, and

On środa, 17 listopada 2021 16:51:57 CET you wrote: this

...
is true even with "clean check".

I agree with You on that and I would really like to find better/other solution for the Neutron problem than rechecking only broken jobs as I'm pretty sure that this would make things much worst quickly.

...
Do we as developers find value in knowing the software needs attention

before

...
it gets released to users? Do the users find value in running reliable software? In the past we have asserted that "yes, there is value in this", and have invested in tracking, investigating, and fixing these problems even if they happen infrequently. But that does require investment, and active maintenance.

Clark

-- Slawek Kaplonski Principal Software Engineer Red Hat

Balazs Gibizer

18 Nov 18 Nov

6:39 a.m.

On Wed, Nov 17 2021 at 07:51:57 AM -0800, Clark Boylan <cboylan@sapwetik.org> wrote:

...

On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:

...
Snip. I want to respond to a specific suggestion:

...
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

OpenStack has configured its check and gate queues with something we've called "clean check". This refers to the requirement that before an OpenStack project can be gated it must pass check tests first. This policy was instituted because a number of these infrequent but problematic issues were traced back to recheck spamming. Basically changes would show up and were broken. They would fail some percentage of the time. They got rechecked until they finally merged and now their failure rate is added to the whole. This rule was introduced to make it more difficult to get this flakyness into the gate.

Locking in test results is in direct opposition to the existing policy and goals. Locking results would make it far more trivial to land such flakyness as you wouldn't need entire sets of jobs to pass before you could land. Instead you could rerun individual jobs until each one passed and then land the result. Potentially introducing significant flakyness with a single merge.

Locking results is also not really something that fits well with the speculative gate queues that Zuul runs. Remember that Zuul constructs a future git state and tests that in parallel. Currently the state for OpenStack looks like:

A - Nova ^ B - Glance ^ C - Neutron ^ D - Neutron ^ F - Neutron

The B glance change is tested as if the A Nova change has already merged and so on down the queue. If we want to keep these speculative states we can't really have humans manually verify a failure can be ignored and retry it. Because we'd be enqueuing job builds at different stages of speculative state. Each job build would be testing a different version of the software.

What we could do is implement a retry limit for failing jobs. Zuul could rerun failing jobs X times before giving up and reporting failure (this would require updates to Zuul). The problem with this approach is without some oversight it becomes very easy to land changes that make things worse. As a side note Zuul does do retries, but only for detected network errors or when a pre-run playbook fails. The assumption is that network failures are due to the dangers of the Internet, and that pre-run playbooks are small, self contained, unlikely to fail, and when they do fail the failure should be independent of what is being tested.

Where does that leave us?

I think it is worth considering the original goals of "clean check". We know that rechecking/rerunning only makes these problems worse in the long term. They represent technical debt. One of the reasons we run these tests is to show us when our software is broken. In the case of flaky results we are exposing this technical debt where it impacts the functionality of our software. The longer we avoid fixing these issues the worse it gets, and this is true even with "clean check".

Do we as developers find value in knowing the software needs attention before it gets released to users? Do the users find value in running reliable software? In the past we have asserted that "yes, there is value in this", and have invested in tracking, investigating, and fixing these problems even if they happen infrequently. But that does require investment, and active maintenance.

Thank you Clark! I agree with your view that the current setup provides us with very valuable information about the health of the software we are developing. I also agree that our primary goal should be to fix the flaky tests instead of hiding the results under any kind of rechecks. Still I'm wondering what we will do if it turns out that the existing developer bandwidth shrunk to the point where we simply not have the capacity for fix these technical debts. What the stable team does on stable branches in Extended Maintenance mode in a similar situation is to simply turn off problematic test jobs. So I guess that is also a valid last resort move. Cheers, gibi

...

Clark

Sean Mooney

7:19 a.m.

On Thu, 2021-11-18 at 15:39 +0100, Balazs Gibizer wrote:

...

On Wed, Nov 17 2021 at 07:51:57 AM -0800, Clark Boylan <cboylan@sapwetik.org> wrote:

...
On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:

...
Snip. I want to respond to a specific suggestion:

...
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

OpenStack has configured its check and gate queues with something we've called "clean check". This refers to the requirement that before an OpenStack project can be gated it must pass check tests first. This policy was instituted because a number of these infrequent but problematic issues were traced back to recheck spamming. Basically changes would show up and were broken. They would fail some percentage of the time. They got rechecked until they finally merged and now their failure rate is added to the whole. This rule was introduced to make it more difficult to get this flakyness into the gate.

Locking in test results is in direct opposition to the existing policy and goals. Locking results would make it far more trivial to land such flakyness as you wouldn't need entire sets of jobs to pass before you could land. Instead you could rerun individual jobs until each one passed and then land the result. Potentially introducing significant flakyness with a single merge.

Locking results is also not really something that fits well with the speculative gate queues that Zuul runs. Remember that Zuul constructs a future git state and tests that in parallel. Currently the state for OpenStack looks like:

A - Nova ^ B - Glance ^ C - Neutron ^ D - Neutron ^ F - Neutron

The B glance change is tested as if the A Nova change has already merged and so on down the queue. If we want to keep these speculative states we can't really have humans manually verify a failure can be ignored and retry it. Because we'd be enqueuing job builds at different stages of speculative state. Each job build would be testing a different version of the software.

What we could do is implement a retry limit for failing jobs. Zuul could rerun failing jobs X times before giving up and reporting failure (this would require updates to Zuul). The problem with this approach is without some oversight it becomes very easy to land changes that make things worse. As a side note Zuul does do retries, but only for detected network errors or when a pre-run playbook fails. The assumption is that network failures are due to the dangers of the Internet, and that pre-run playbooks are small, self contained, unlikely to fail, and when they do fail the failure should be independent of what is being tested.

Where does that leave us?

I think it is worth considering the original goals of "clean check". We know that rechecking/rerunning only makes these problems worse in the long term. They represent technical debt. One of the reasons we run these tests is to show us when our software is broken. In the case of flaky results we are exposing this technical debt where it impacts the functionality of our software. The longer we avoid fixing these issues the worse it gets, and this is true even with "clean check".

Do we as developers find value in knowing the software needs attention before it gets released to users? Do the users find value in running reliable software? In the past we have asserted that "yes, there is value in this", and have invested in tracking, investigating, and fixing these problems even if they happen infrequently. But that does require investment, and active maintenance.

Thank you Clark! I agree with your view that the current setup provides us with very valuable information about the health of the software we are developing. I also agree that our primary goal should be to fix the flaky tests instead of hiding the results under any kind of rechecks.

Still I'm wondering what we will do if it turns out that the existing developer bandwidth shrunk to the point where we simply not have the capacity for fix these technical debts. What the stable team does on stable branches in Extended Maintenance mode in a similar situation is to simply turn off problematic test jobs. So I guess that is also a valid last resort move.

one option is to "trust" the core team more and grant them explict rigth to workflow +2 and force merge a patch. trust is in quotes because its not really about trusting that the core teams can restrain themselve form blindly merging broken code but more a case of right now we entrust zuul to be the final gate keeper of our repo. When there are known broken gate failure and we are trying to land specific patch to say nova to fix or unblock the nuetron gate and we can see the neutron DNM patch that depens on this nova fix passsed then we could entrust the core team in this specific case to override zuul. i would expect this capablity to be used very spareinly but we do have some intermitent failures that happen that we can tell are unrelated to the patch like the curernt issue with volumne attach/detach that result in kernel panics in the guest. if that is the only failure and all other test passed in gate i think it woudl be reasonable for a the neutron team to approve a neutron patch that modifies security groups for example. its very clearly an unrealted failure. that might be an alternivie to the recheck we have now and by resreving that for the core team it limits the scope for abusing this. i do think that the orginal goes of green check are good so really i would be suggesting this as an option for when check passed and we get an intermient failure in gate that we woudl override. this would not adress the issue in check but it would make itermitent failure in gate much less painful.

...

Cheers, gibi

...
Clark

Clark Boylan

7:46 a.m.

On Thu, Nov 18, 2021, at 7:19 AM, Sean Mooney wrote:

...

On Thu, 2021-11-18 at 15:39 +0100, Balazs Gibizer wrote:

...
On Wed, Nov 17 2021 at 07:51:57 AM -0800, Clark Boylan <cboylan@sapwetik.org> wrote:

...
On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:

...
Snip. I want to respond to a specific suggestion:

...
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

OpenStack has configured its check and gate queues with something we've called "clean check". This refers to the requirement that before an OpenStack project can be gated it must pass check tests first. This policy was instituted because a number of these infrequent but problematic issues were traced back to recheck spamming. Basically changes would show up and were broken. They would fail some percentage of the time. They got rechecked until they finally merged and now their failure rate is added to the whole. This rule was introduced to make it more difficult to get this flakyness into the gate.

Locking in test results is in direct opposition to the existing policy and goals. Locking results would make it far more trivial to land such flakyness as you wouldn't need entire sets of jobs to pass before you could land. Instead you could rerun individual jobs until each one passed and then land the result. Potentially introducing significant flakyness with a single merge.

Locking results is also not really something that fits well with the speculative gate queues that Zuul runs. Remember that Zuul constructs a future git state and tests that in parallel. Currently the state for OpenStack looks like:

A - Nova ^ B - Glance ^ C - Neutron ^ D - Neutron ^ F - Neutron

The B glance change is tested as if the A Nova change has already merged and so on down the queue. If we want to keep these speculative states we can't really have humans manually verify a failure can be ignored and retry it. Because we'd be enqueuing job builds at different stages of speculative state. Each job build would be testing a different version of the software.

What we could do is implement a retry limit for failing jobs. Zuul could rerun failing jobs X times before giving up and reporting failure (this would require updates to Zuul). The problem with this approach is without some oversight it becomes very easy to land changes that make things worse. As a side note Zuul does do retries, but only for detected network errors or when a pre-run playbook fails. The assumption is that network failures are due to the dangers of the Internet, and that pre-run playbooks are small, self contained, unlikely to fail, and when they do fail the failure should be independent of what is being tested.

Where does that leave us?

I think it is worth considering the original goals of "clean check". We know that rechecking/rerunning only makes these problems worse in the long term. They represent technical debt. One of the reasons we run these tests is to show us when our software is broken. In the case of flaky results we are exposing this technical debt where it impacts the functionality of our software. The longer we avoid fixing these issues the worse it gets, and this is true even with "clean check".

Do we as developers find value in knowing the software needs attention before it gets released to users? Do the users find value in running reliable software? In the past we have asserted that "yes, there is value in this", and have invested in tracking, investigating, and fixing these problems even if they happen infrequently. But that does require investment, and active maintenance.

Thank you Clark! I agree with your view that the current setup provides us with very valuable information about the health of the software we are developing. I also agree that our primary goal should be to fix the flaky tests instead of hiding the results under any kind of rechecks.

Still I'm wondering what we will do if it turns out that the existing developer bandwidth shrunk to the point where we simply not have the capacity for fix these technical debts. What the stable team does on stable branches in Extended Maintenance mode in a similar situation is to simply turn off problematic test jobs. So I guess that is also a valid last resort move.

one option is to "trust" the core team more and grant them explict rigth to workflow +2 and force merge a patch.

trust is in quotes because its not really about trusting that the core teams can restrain themselve form blindly merging broken code but more a case of right now we entrust zuul to be the final gate keeper of our repo.

When there are known broken gate failure and we are trying to land specific patch to say nova to fix or unblock the nuetron gate and we can see the neutron DNM patch that depens on this nova fix passsed then we could entrust the core team in this specific case to override zuul.

We do already give you this option via the removal of tests that are invalid/flaky/not useful. I do worry that if we give a complete end around the CI system it will be quickly abused. We stopped requiring a bug on rechecks because we quickly realized that no one was actually debugging the failure and identifying the underlying issue. Instead they would just recheck with an arbitrary or completely wrong bug identified. I expect similar would happen here. And the end result would be that CI would simply get more flaky and unreliable for the next change. If instead we fix or remove the flaky tests/jobs we'll end up with a system that is more reliable for the next change.

...

i would expect this capablity to be used very spareinly but we do have some intermitent failures that happen that we can tell are unrelated to the patch like the curernt issue with volumne attach/detach that result in kernel panics in the guest. if that is the only failure and all other test passed in gate i think it woudl be reasonable for a the neutron team to approve a neutron patch that modifies security groups for example. its very clearly an unrealted failure.

As noted above, it would also be reasonable to stop running tests that cannot function. We do need to be careful that we don't remove tests and never fix the underlying issues though. We should also remember that if we have these problems in CI there is a high chance that our users will have these problems in production later (we've helped more than one of the infra donor clouds identify bugs straight out of elastic-recheck information in the past so this does happen).

...

that might be an alternivie to the recheck we have now and by resreving that for the core team it limits the scope for abusing this.

i do think that the orginal goes of green check are good so really i would be suggesting this as an option for when check passed and we get an intermient failure in gate that we woudl override.

this would not adress the issue in check but it would make itermitent failure in gate much less painful.

I tried to make this point in my previous email, but I think we are still fumbling around it. If we provide mechanisms to end around flaky CI instead of fixing flaky CI the end result will be flakier CI. I'm not convinced that we'll be happier with any mechanism that doesn't remove the -1 from happening in the first place. Instead the problems will accelerate and eventually we'll be unable to rely on CI for anything useful.

Dan Smith

8:15 a.m.

...

We do already give you this option via the removal of tests that are invalid/flaky/not useful. I do worry that if we give a complete end around the CI system it will be quickly abused.

Absolutely agree, humans are not good at making these decisions. Despite "trust" in the core team, and even using a less-loaded word than "abuse," I really don't think that even allowing the option to override flaky tests by force merge is the right solution (at all).

...

I tried to make this point in my previous email, but I think we are still fumbling around it. If we provide mechanisms to end around flaky CI instead of fixing flaky CI the end result will be flakier CI. I'm not convinced that we'll be happier with any mechanism that doesn't remove the -1 from happening in the first place. Instead the problems will accelerate and eventually we'll be unable to rely on CI for anything useful.

Agreed. Either the tests are useful or they aren't. Even if they're not very reliable, they might be useful in causing pain because they continue to highlight flaky behavior until it gets fixed. --Dan

Jeremy Stanley

8:24 a.m.

On 2021-11-18 08:15:06 -0800 (-0800), Dan Smith wrote: [...]

...

Absolutely agree, humans are not good at making these decisions. Despite "trust" in the core team, and even using a less-loaded word than "abuse," I really don't think that even allowing the option to override flaky tests by force merge is the right solution (at all). [...]

Just about any time we Gerrit admins have decided to bypass testing to merge some change (and to be clear, we really don't like to if we can avoid it), we introduce a new test-breaking bug we then need to troubleshoot and fix. It's a humbling reminder that even though you may feel absolutely sure something's safe to merge without passing tests, you're probably wrong. -- Jeremy Stanley

Ghanshyam Mann

8:47 a.m.

---- On Thu, 18 Nov 2021 10:24:31 -0600 Jeremy Stanley <fungi@yuggoth.org> wrote ----

...

On 2021-11-18 08:15:06 -0800 (-0800), Dan Smith wrote: [...]

...
Absolutely agree, humans are not good at making these decisions. Despite "trust" in the core team, and even using a less-loaded word than "abuse," I really don't think that even allowing the option to override flaky tests by force merge is the right solution (at all). [...]

Just about any time we Gerrit admins have decided to bypass testing to merge some change (and to be clear, we really don't like to if we can avoid it), we introduce a new test-breaking bug we then need to troubleshoot and fix. It's a humbling reminder that even though you may feel absolutely sure something's safe to merge without passing tests, you're probably wrong.

Indeed. I too agree here and it can lead to the situation that 'hey my patch was all good can you just +W this' which can end up more unstable tests/code. -gmann

...

-- Jeremy Stanley

Sean Mooney

12:57 p.m.

...

On 2021-11-18 08:15:06 -0800 (-0800), Dan Smith wrote: [...]

...
Absolutely agree, humans are not good at making these decisions. Despite "trust" in the core team, and even using a less-loaded word than "abuse," I really don't think that even allowing the option to override flaky tests by force merge is the right solution (at all). [...]

Just about any time we Gerrit admins have decided to bypass testing to merge some change (and to be clear, we really don't like to if we can avoid it), we introduce a new test-breaking bug we then need to troubleshoot and fix. It's a humbling reminder that even though you may feel absolutely sure something's safe to merge without passing tests, you're probably wrong. well the example i gave is a failure in the interaction between nova and cinder failing in the neturon gate.

On Thu, 2021-11-18 at 16:24 +0000, Jeremy Stanley wrote: there is no way the neutron patch under reivew could cause that failure to happen and i chose a specific example of the intermite failure we have with the compute volume detach fiailest where it looks like the bug is actully the tempest test. https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/z... it appear that for some reason attaching a cinder volume and live migrating the vm while the kernel/OS in the vm is still booting up can result in a kernel panic. This has been an on going battel to solve for many weeks. there is no way that a change in neutron or glance or keystone patch could have cause the guest kernel to crash. https://bugs.launchpad.net/nova/+bug/1950310 and https://bugs.launchpad.net/nova/+bug/1939108 are two of the related bugs if they are running the tempest.api.compute.admin.test_live_migration.LiveMigrationTest* in any of there jobs however they could have been impacted by this lee yarwood has started implemeting a very old tepest spec https://specs.openstack.org/openstack/qa-specs/specs/tempest/implemented/ssh... for this and we think that will fix the test failure.https://review.opendev.org/c/openstack/tempest/+/817772/2 i suspect we have many other cases in tempest today where we have intermitent failures cause by the guest os not being ready before we do operations on the guest beyond the curent volume attach/detach issues i did not sucggest allowint the ci to be overrdien because i think that is generally a good idea, its not but some time there are failure that we are activly trying to fix but have not found a solution for for months. im pretty sure this live migration test prevent patches to the ironic virt driver landing not so long ago requiring sevel retries. the ironic virirt dirver obvioly does not supprot live migrationand the chagne was not touching any oter part of nova so the failure was unrealted. https://review.opendev.org/c/openstack/nova/+/799327 is the cahange i was thinking of the master version need 3 recheck the backprot needed 6 more https://review.opendev.org/c/openstack/nova/+/799772 that may have actully been casue by https://bugs.launchpad.net/nova/+bug/1931702 which is an other bug for a similar kernel panic but i would not be surpirsed if it ws actully the same root causes. i think that point was lost in my orginal message. the point i was trying to make is sometimes the failture is not about the code under review its because the test is wrong. we shoudl fix the test but it can be very frustrating if you recheck somethign 3-4 times where it passes in check and fails in gate for somethign you know is unrealated but you dont want to disable the test because you dont want to losse coverate for somthign that typically fails a low amount of time. regards sean.

Clark Boylan

7:33 a.m.

On Thu, Nov 18, 2021, at 6:39 AM, Balazs Gibizer wrote:

...

On Wed, Nov 17 2021 at 07:51:57 AM -0800, Clark Boylan <cboylan@sapwetik.org> wrote:

...
On Wed, Nov 17, 2021, at 2:18 AM, Balazs Gibizer wrote:

...
Snip. I want to respond to a specific suggestion:

...
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

OpenStack has configured its check and gate queues with something we've called "clean check". This refers to the requirement that before an OpenStack project can be gated it must pass check tests first. This policy was instituted because a number of these infrequent but problematic issues were traced back to recheck spamming. Basically changes would show up and were broken. They would fail some percentage of the time. They got rechecked until they finally merged and now their failure rate is added to the whole. This rule was introduced to make it more difficult to get this flakyness into the gate.

Locking in test results is in direct opposition to the existing policy and goals. Locking results would make it far more trivial to land such flakyness as you wouldn't need entire sets of jobs to pass before you could land. Instead you could rerun individual jobs until each one passed and then land the result. Potentially introducing significant flakyness with a single merge.

Locking results is also not really something that fits well with the speculative gate queues that Zuul runs. Remember that Zuul constructs a future git state and tests that in parallel. Currently the state for OpenStack looks like:

A - Nova ^ B - Glance ^ C - Neutron ^ D - Neutron ^ F - Neutron

The B glance change is tested as if the A Nova change has already merged and so on down the queue. If we want to keep these speculative states we can't really have humans manually verify a failure can be ignored and retry it. Because we'd be enqueuing job builds at different stages of speculative state. Each job build would be testing a different version of the software.

What we could do is implement a retry limit for failing jobs. Zuul could rerun failing jobs X times before giving up and reporting failure (this would require updates to Zuul). The problem with this approach is without some oversight it becomes very easy to land changes that make things worse. As a side note Zuul does do retries, but only for detected network errors or when a pre-run playbook fails. The assumption is that network failures are due to the dangers of the Internet, and that pre-run playbooks are small, self contained, unlikely to fail, and when they do fail the failure should be independent of what is being tested.

Where does that leave us?

I think it is worth considering the original goals of "clean check". We know that rechecking/rerunning only makes these problems worse in the long term. They represent technical debt. One of the reasons we run these tests is to show us when our software is broken. In the case of flaky results we are exposing this technical debt where it impacts the functionality of our software. The longer we avoid fixing these issues the worse it gets, and this is true even with "clean check".

Do we as developers find value in knowing the software needs attention before it gets released to users? Do the users find value in running reliable software? In the past we have asserted that "yes, there is value in this", and have invested in tracking, investigating, and fixing these problems even if they happen infrequently. But that does require investment, and active maintenance.

Thank you Clark! I agree with your view that the current setup provides us with very valuable information about the health of the software we are developing. I also agree that our primary goal should be to fix the flaky tests instead of hiding the results under any kind of rechecks.

Still I'm wondering what we will do if it turns out that the existing developer bandwidth shrunk to the point where we simply not have the capacity for fix these technical debts. What the stable team does on stable branches in Extended Maintenance mode in a similar situation is to simply turn off problematic test jobs. So I guess that is also a valid last resort move.

Absolutely reduce scope if necessary. We run a huge assortment of jobs because we've added support for the kitchen sink to OpenStack. If we can't continue to reliably test those features then it should be completely valid to remove testing and probably deprecate and remove the features as well. Historically we've done this for things like postgresql support so this isn't a new problem.

...

Cheers, gibi

Slawek Kaplonski

17 Nov 17 Nov

11:42 p.m.

Hi, On środa, 17 listopada 2021 11:18:03 CET Balazs Gibizer wrote:

...

On Wed, Nov 17 2021 at 09:13:34 AM +0100, Slawek Kaplonski

<skaplons@redhat.com> wrote:

...
Hi,

Recently I spent some time to check how many rechecks we need in Neutron to get patch merged and I compared it to some other OpenStack projects (see [1] for details). TL;DR - results aren't good for us and I think we really need to do something with that.

I really like the idea of collecting such stats. Thank you for doing it. I can even imagine to make a public dashboard somewhere with this information as it is a good indication about the health of our projects / testing.

Thx. So far it's just simple script which I run from my terminal to get that data. Nothing else. If You want to use it, it's here https://github.com/ slawqo/tools/tree/master/rechecks

...

...
Of course "easiest" thing to say is that we should fix issues which we are hitting in the CI to make jobs more stable. But it's not that easy. We are struggling with those jobs for very long time. We have CI related meeting every week and we are fixing what we can there. Unfortunately there is still bunch of issues which we can't fix so far because they are intermittent and hard to reproduce locally or in some cases the issues aren't realy related to the Neutron or there are new bugs which we need to investigate and fix :)

I have couple of suggestion based on my experience working with CI in nova.

1) we try to open bug reports for intermittent gate failures too and keep them tagged in a list [1] so when a job fail it is easy to check if the bug is known.

Thx. We are trying more or less to do that, but TBH I think that in many cases we didn't open LPs for such issues. I added it to the list of ideas :)

...

2) I offer my help here now that if you see something in neutron runs that feels non neutron specific then ping me with it. Maybe we are struggling with the same problem too.

Thank a lot. I will for sure ping You when I will see something like that.

...

3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

yes, I remember that discussion and I totally understand pros and cons of such solution, but I added it to the list as well.

...

Cheers, gibi

...
So this is never ending battle for us. The problem is that we have to test various backends, drivers, etc. so as a result we have many jobs running on each patch - excluding UT, pep8 and docs jobs we have around 19 jobs in check and 14 jobs in gate queue.

In the past we made a lot of improvements, like e.g. we improved irrelevant files lists for jobs to run less jobs on some of the patches, together with QA team we did "integrated-networking" template to run only Neutron and Nova related scenario tests in the Neutron queues, we removed and consolidated some of the jobs (there is still one patch in progress for that but it should just remove around 2 jobs from the check queue). All of that are good improvements but still not enough to make our CI really stable :/

Because of all of that, I would like to ask community about any other ideas how we can improve that. If You have any ideas, please send it in this email thread or reach out to me directly on irc. We want to discuss about them in the next video CI meeting which will be on November 30th. If You would have any idea and would like to join that discussion, You are more than welcome in that meeting of course :)

[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-November/ 025759.html

[1] https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_las t_updated&start=0

...
-- Slawek Kaplonski Principal Software Engineer Red Hat

-- Slawek Kaplonski Principal Software Engineer Red Hat

Ghanshyam Mann

18 Nov 18 Nov

7:12 a.m.

---- On Thu, 18 Nov 2021 01:42:22 -0600 Slawek Kaplonski <skaplons@redhat.com> wrote ----

...

Hi,

On środa, 17 listopada 2021 11:18:03 CET Balazs Gibizer wrote:

...
On Wed, Nov 17 2021 at 09:13:34 AM +0100, Slawek Kaplonski

<skaplons@redhat.com> wrote:

...
Hi,

Recently I spent some time to check how many rechecks we need in Neutron to get patch merged and I compared it to some other OpenStack projects (see [1] for details). TL;DR - results aren't good for us and I think we really need to do something with that.

I really like the idea of collecting such stats. Thank you for doing it. I can even imagine to make a public dashboard somewhere with this information as it is a good indication about the health of our projects / testing.

Thx. So far it's just simple script which I run from my terminal to get that data. Nothing else. If You want to use it, it's here https://github.com/ slawqo/tools/tree/master/rechecks

...
...
Of course "easiest" thing to say is that we should fix issues which we are hitting in the CI to make jobs more stable. But it's not that easy. We are struggling with those jobs for very long time. We have CI related meeting every week and we are fixing what we can there. Unfortunately there is still bunch of issues which we can't fix so far because they are intermittent and hard to reproduce locally or in some cases the issues aren't realy related to the Neutron or there are new bugs which we need to investigate and fix :)

I have couple of suggestion based on my experience working with CI in nova.

1) we try to open bug reports for intermittent gate failures too and keep them tagged in a list [1] so when a job fail it is easy to check if the bug is known.

Thx. We are trying more or less to do that, but TBH I think that in many cases we didn't open LPs for such issues. I added it to the list of ideas :)

+1, I think opening bugs is the best way to get the project notified and also track the issue. I like the Slawek script to collect the recheck per project and that is something we can use in TC tracking the gate health in the weekly meeting and see which project is having more recheck, Recheck does not mean that project has the issue but at least we will encourage members to open bug on corresponding projects. -gmann

...

...
2) I offer my help here now that if you see something in neutron runs that feels non neutron specific then ping me with it. Maybe we are struggling with the same problem too.

Thank a lot. I will for sure ping You when I will see something like that.

...
3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

yes, I remember that discussion and I totally understand pros and cons of such solution, but I added it to the list as well.

...
Cheers, gibi

...
So this is never ending battle for us. The problem is that we have to test various backends, drivers, etc. so as a result we have many jobs running on each patch - excluding UT, pep8 and docs jobs we have around 19 jobs in check and 14 jobs in gate queue.

In the past we made a lot of improvements, like e.g. we improved irrelevant files lists for jobs to run less jobs on some of the patches, together with QA team we did "integrated-networking" template to run only Neutron and Nova related scenario tests in the Neutron queues, we removed and consolidated some of the jobs (there is still one patch in progress for that but it should just remove around 2 jobs from the check queue). All of that are good improvements but still not enough to make our CI really stable :/

Because of all of that, I would like to ask community about any other ideas how we can improve that. If You have any ideas, please send it in this email thread or reach out to me directly on irc. We want to discuss about them in the next video CI meeting which will be on November 30th. If You would have any idea and would like to join that discussion, You are more than welcome in that meeting of course :)

[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-November/ 025759.html

[1] https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_las t_updated&start=0

...
-- Slawek Kaplonski Principal Software Engineer Red Hat

-- Slawek Kaplonski Principal Software Engineer Red Hat

Ronelle Landy

3:52 p.m.

On Wed, Nov 17, 2021 at 5:22 AM Balazs Gibizer <balazs.gibizer@est.tech> wrote:

...

On Wed, Nov 17 2021 at 09:13:34 AM +0100, Slawek Kaplonski <skaplons@redhat.com> wrote:

...
Hi,

Recently I spent some time to check how many rechecks we need in Neutron to get patch merged and I compared it to some other OpenStack projects (see [1] for details). TL;DR - results aren't good for us and I think we really need to do something with that.

I really like the idea of collecting such stats. Thank you for doing it. I can even imagine to make a public dashboard somewhere with this information as it is a good indication about the health of our projects / testing.

...
Of course "easiest" thing to say is that we should fix issues which we are hitting in the CI to make jobs more stable. But it's not that easy. We are struggling with those jobs for very long time. We have CI related meeting every week and we are fixing what we can there. Unfortunately there is still bunch of issues which we can't fix so far because they are intermittent and hard to reproduce locally or in some cases the issues aren't realy related to the Neutron or there are new bugs which we need to investigate and fix :)

I have couple of suggestion based on my experience working with CI in nova.

We've struggled with unstable tests in TripleO as well. Here are some things we tried and implemented: 1. Created job dependencies so we only ran check tests once we knew we had the resources we needed (example we had pulled containers successfully) 2. Moved some testing to third party where we have easier control of the environment (note that third party cannot stop a change merging) 3. Used dependency pipelines to pre-qualify some dependencies ahead of letting them run wild on our check jobs 4. Requested testproject runs of changes in a less busy environment before running a full set of tests in a public zuul 5. Used a skiplist to keep track of tech debt and skip known failures that we could temporarily ignore to keep CI moving along if we're waiting on an external fix.

...

1) we try to open bug reports for intermittent gate failures too and keep them tagged in a list [1] so when a job fail it is easy to check if the bug is known.

2) I offer my help here now that if you see something in neutron runs that feels non neutron specific then ping me with it. Maybe we are struggling with the same problem too.

3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

Cheers, gibi

...
So this is never ending battle for us. The problem is that we have to test various backends, drivers, etc. so as a result we have many jobs running on each patch - excluding UT, pep8 and docs jobs we have around 19 jobs in check and 14 jobs in gate queue.

In the past we made a lot of improvements, like e.g. we improved irrelevant files lists for jobs to run less jobs on some of the patches, together with QA team we did "integrated-networking" template to run only Neutron and Nova related scenario tests in the Neutron queues, we removed and consolidated some of the jobs (there is still one patch in progress for that but it should just remove around 2 jobs from the check queue). All of that are good improvements but still not enough to make our CI really stable :/

Because of all of that, I would like to ask community about any other ideas how we can improve that. If You have any ideas, please send it in this email thread or reach out to me directly on irc. We want to discuss about them in the next video CI meeting which will be on November 30th. If You would have any idea and would like to join that discussion, You are more than welcome in that meeting of course :)

[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-November/ 025759.html

[1]

https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_last_updated&start=0

...
-- Slawek Kaplonski Principal Software Engineer Red Hat

Rodolfo Alonso Hernandez

22 Nov 22 Nov

12:54 a.m.

Hello: I think the last idea Ronelled presented (a skiplist) could be feasible in Neutron. Of course, this list could grow indefinitely, but we can always keep an eye on it. There could be another issue with Neutron tempest tests when using the "advance" image. Despite the recent improvements done recently, we are frequently having problems with the RAM size of the testing VMs. We would like to have 20% more RAM, if possible. I wish we had the ability to pre-run some checks in specific HW (tempest plugin or grenade tests). Slawek commented the different number of backends we need to provide support and testing. However I think we can remove the Linux Bridge tempest plugin from the "gate" list (it is already tested in the "check" list). Tempest plugin tests are expensive in time and prone to errors. This paragraph falls under the shoulders of the Neutron team. We can also identify those long running tests that usually fail (those that take more than 1000 seconds). A test that takes around 15 mins to run, will probably fail. We need to find those tests, investigate the slowest parts of those tests and try to improve/optimize/remove them. Thank you all for your comments and proposals. That will help a lot to improve the Neutron CI stability. Regards. On Fri, Nov 19, 2021 at 12:53 AM Ronelle Landy <rlandy@redhat.com> wrote:

...

On Wed, Nov 17, 2021 at 5:22 AM Balazs Gibizer <balazs.gibizer@est.tech> wrote:

...
On Wed, Nov 17 2021 at 09:13:34 AM +0100, Slawek Kaplonski <skaplons@redhat.com> wrote:

...
Hi,

Recently I spent some time to check how many rechecks we need in Neutron to get patch merged and I compared it to some other OpenStack projects (see [1] for details). TL;DR - results aren't good for us and I think we really need to do something with that.

I really like the idea of collecting such stats. Thank you for doing it. I can even imagine to make a public dashboard somewhere with this information as it is a good indication about the health of our projects / testing.

...
Of course "easiest" thing to say is that we should fix issues which we are hitting in the CI to make jobs more stable. But it's not that easy. We are struggling with those jobs for very long time. We have CI related meeting every week and we are fixing what we can there. Unfortunately there is still bunch of issues which we can't fix so far because they are intermittent and hard to reproduce locally or in some cases the issues aren't realy related to the Neutron or there are new bugs which we need to investigate and fix :)

I have couple of suggestion based on my experience working with CI in nova.

We've struggled with unstable tests in TripleO as well. Here are some things we tried and implemented:

1. Created job dependencies so we only ran check tests once we knew we had the resources we needed (example we had pulled containers successfully)

2. Moved some testing to third party where we have easier control of the environment (note that third party cannot stop a change merging)

3. Used dependency pipelines to pre-qualify some dependencies ahead of letting them run wild on our check jobs

4. Requested testproject runs of changes in a less busy environment before running a full set of tests in a public zuul

5. Used a skiplist to keep track of tech debt and skip known failures that we could temporarily ignore to keep CI moving along if we're waiting on an external fix.

...
1) we try to open bug reports for intermittent gate failures too and keep them tagged in a list [1] so when a job fail it is easy to check if the bug is known.

2) I offer my help here now that if you see something in neutron runs that feels non neutron specific then ping me with it. Maybe we are struggling with the same problem too.

3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails.

Cheers, gibi

...
So this is never ending battle for us. The problem is that we have to test various backends, drivers, etc. so as a result we have many jobs running on each patch - excluding UT, pep8 and docs jobs we have around 19 jobs in check and 14 jobs in gate queue.

In the past we made a lot of improvements, like e.g. we improved irrelevant files lists for jobs to run less jobs on some of the patches, together with QA team we did "integrated-networking" template to run only Neutron and Nova related scenario tests in the Neutron queues, we removed and consolidated some of the jobs (there is still one patch in progress for that but it should just remove around 2 jobs from the check queue). All of that are good improvements but still not enough to make our CI really stable :/

Because of all of that, I would like to ask community about any other ideas how we can improve that. If You have any ideas, please send it in this email thread or reach out to me directly on irc. We want to discuss about them in the next video CI meeting which will be on November 30th. If You would have any idea and would like to join that discussion, You are more than welcome in that meeting of course :)

[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-November/ 025759.html

[1]

https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_last_updated&start=0

...
-- Slawek Kaplonski Principal Software Engineer Red Hat

Oleg Bondarev

25 Nov 25 Nov

12:57 a.m.

Hello, A few thoughts from my side is scope of brainstorm: 1) Recheck actual bugs (“recheck bug 123456”) - not a new idea to better keep track of all failures - force a developer to investigate the reason of each CI failure and increase corresponding bug rating, or file a new bug (or go and fix this bug finally!) - this implies having some gate failure bugs dashboard with hottest bugs on top - simple “recheck” could be forbidden, at least during “crisis management” window 2) Allow recheck TIMEOUT/POST_FAILURE jobs - while I agree that re-run particular jobs is evil, TIMEOUT/POST_FAILURE are not related to the patch in majority of cases - performance issues are usually caught by Rally jobs - of course core team should monitor if timeouts become a rule for some jobs 3) Ability to block rechecks in some cases, like known gate blocker - not everyone is always aware that gates are blocked with some issue - PTL (or any core team member) can turn off rechecks during that time (with a message from Zuul) - happens not often but still can save some CI resources Thanks, Oleg --- Advanced Software Technology Lab Huawei From: Rodolfo Alonso Hernandez [mailto:ralonsoh@redhat.com] Sent: Monday, November 22, 2021 11:54 AM To: Ronelle Landy <rlandy@redhat.com> Cc: Balazs Gibizer <balazs.gibizer@est.tech>; Slawek Kaplonski <skaplons@redhat.com>; openstack-discuss <openstack-discuss@lists.openstack.org>; Oleg Bondarev <oleg.bondarev@huawei.com>; lajos.katona@ericsson.com; Bernard Cafarelli <bcafarel@redhat.com>; Miguel Lavalle <miguel@mlavalle.com> Subject: Re: [neutron][CI] How to reduce number of rechecks - brainstorming Hello: I think the last idea Ronelled presented (a skiplist) could be feasible in Neutron. Of course, this list could grow indefinitely, but we can always keep an eye on it. There could be another issue with Neutron tempest tests when using the "advance" image. Despite the recent improvements done recently, we are frequently having problems with the RAM size of the testing VMs. We would like to have 20% more RAM, if possible. I wish we had the ability to pre-run some checks in specific HW (tempest plugin or grenade tests). Slawek commented the different number of backends we need to provide support and testing. However I think we can remove the Linux Bridge tempest plugin from the "gate" list (it is already tested in the "check" list). Tempest plugin tests are expensive in time and prone to errors. This paragraph falls under the shoulders of the Neutron team. We can also identify those long running tests that usually fail (those that take more than 1000 seconds). A test that takes around 15 mins to run, will probably fail. We need to find those tests, investigate the slowest parts of those tests and try to improve/optimize/remove them. Thank you all for your comments and proposals. That will help a lot to improve the Neutron CI stability. Regards. On Fri, Nov 19, 2021 at 12:53 AM Ronelle Landy <rlandy@redhat.com<mailto:rlandy@redhat.com>> wrote: On Wed, Nov 17, 2021 at 5:22 AM Balazs Gibizer <balazs.gibizer@est.tech<mailto:balazs.gibizer@est.tech>> wrote: On Wed, Nov 17 2021 at 09:13:34 AM +0100, Slawek Kaplonski <skaplons@redhat.com<mailto:skaplons@redhat.com>> wrote:

...

Hi,

Recently I spent some time to check how many rechecks we need in Neutron to get patch merged and I compared it to some other OpenStack projects (see [1] for details). TL;DR - results aren't good for us and I think we really need to do something with that.

...

Of course "easiest" thing to say is that we should fix issues which we are hitting in the CI to make jobs more stable. But it's not that easy. We are struggling with those jobs for very long time. We have CI related meeting every week and we are fixing what we can there. Unfortunately there is still bunch of issues which we can't fix so far because they are intermittent and hard to reproduce locally or in some cases the issues aren't realy related to the Neutron or there are new bugs which we need to investigate and fix :)

I have couple of suggestion based on my experience working with CI in nova. We've struggled with unstable tests in TripleO as well. Here are some things we tried and implemented: 1. Created job dependencies so we only ran check tests once we knew we had the resources we needed (example we had pulled containers successfully) 2. Moved some testing to third party where we have easier control of the environment (note that third party cannot stop a change merging) 3. Used dependency pipelines to pre-qualify some dependencies ahead of letting them run wild on our check jobs 4. Requested testproject runs of changes in a less busy environment before running a full set of tests in a public zuul 5. Used a skiplist to keep track of tech debt and skip known failures that we could temporarily ignore to keep CI moving along if we're waiting on an external fix. 1) we try to open bug reports for intermittent gate failures too and keep them tagged in a list [1] so when a job fail it is easy to check if the bug is known. 2) I offer my help here now that if you see something in neutron runs that feels non neutron specific then ping me with it. Maybe we are struggling with the same problem too. 3) there was informal discussion before about a possibility to re-run only some jobs with a recheck instead for re-running the whole set. I don't know if this is feasible with Zuul and I think this only treat the symptom not the root case. But still this could be a direction if all else fails. Cheers, gibi

...

So this is never ending battle for us. The problem is that we have to test various backends, drivers, etc. so as a result we have many jobs running on each patch - excluding UT, pep8 and docs jobs we have around 19 jobs in check and 14 jobs in gate queue.

In the past we made a lot of improvements, like e.g. we improved irrelevant files lists for jobs to run less jobs on some of the patches, together with QA team we did "integrated-networking" template to run only Neutron and Nova related scenario tests in the Neutron queues, we removed and consolidated some of the jobs (there is still one patch in progress for that but it should just remove around 2 jobs from the check queue). All of that are good improvements but still not enough to make our CI really stable :/

Because of all of that, I would like to ask community about any other ideas how we can improve that. If You have any ideas, please send it in this email thread or reach out to me directly on irc. We want to discuss about them in the next video CI meeting which will be on November 30th. If You would have any idea and would like to join that discussion, You are more than welcome in that meeting of course :)

[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-November/ 025759.html

[1] https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure&orderby=-date_last_updated&start=0

...

-- Slawek Kaplonski Principal Software Engineer Red Hat

Slawek Kaplonski

7 Dec 7 Dec

2:18 a.m.

Hi, Thank You all for all the ideas and discussion in this thread. With Yatin we prepared summary at https://etherpad.opendev.org/p/neutron-ci-improvements and we want to go over and discuss them on the today's CI meeting. It will be both video and irc meeting. We will talk on https:// meetpad.opendev.org/neutron-ci-meetings Agenda for the meeting is at https://etherpad.opendev.org/p/neutron-ci-meetings Everyone is welcome to join the discussion :) On środa, 17 listopada 2021 09:13:34 CET Slawek Kaplonski wrote:

...

Hi,

Recently I spent some time to check how many rechecks we need in Neutron to get patch merged and I compared it to some other OpenStack projects (see [1] for details). TL;DR - results aren't good for us and I think we really need to do something with that.

Of course "easiest" thing to say is that we should fix issues which we are hitting in the CI to make jobs more stable. But it's not that easy. We are struggling with those jobs for very long time. We have CI related meeting every week and we are fixing what we can there. Unfortunately there is still bunch of issues which we can't fix so far because they are intermittent and hard to reproduce locally or in some cases the issues aren't realy related to the Neutron or there are new bugs which we need to investigate and fix :) So this is never ending battle for us. The problem is that we have to test various backends, drivers, etc. so as a result we have many jobs running on each patch - excluding UT, pep8 and docs jobs we have around 19 jobs in check and 14 jobs in gate queue.

In the past we made a lot of improvements, like e.g. we improved irrelevant files lists for jobs to run less jobs on some of the patches, together with QA team we did "integrated-networking" template to run only Neutron and Nova related scenario tests in the Neutron queues, we removed and consolidated some of the jobs (there is still one patch in progress for that but it should just remove around 2 jobs from the check queue). All of that are good improvements but still not enough to make our CI really stable :/

Because of all of that, I would like to ask community about any other ideas how we can improve that. If You have any ideas, please send it in this email thread or reach out to me directly on irc. We want to discuss about them in the next video CI meeting which will be on November 30th. If You would have any idea and would like to join that discussion, You are more than welcome in that meeting of course :)

[1] http://lists.openstack.org/pipermail/openstack-discuss/2021-November/ 025759.html

-- Slawek Kaplonski Principal Software Engineer Red Hat

1439

Age (days ago)

1459

Last active (days ago)

List overview

Download

19 comments

11 participants

participants (11)

Balazs Gibizer
Clark Boylan
Dan Smith
Ghanshyam Mann
Jeremy Stanley
Lajos Katona
Oleg Bondarev
Rodolfo Alonso Hernandez
Ronelle Landy
Sean Mooney
Slawek Kaplonski

[neutron][CI] How to reduce number of rechecks - brainstorming

tags

participants (11)