[ironic] [qa] ironic-tempest-plugin CI bloat
Hi all and happy new year :) As you know, tempest plugins are branchless, so the CI of ironic-tempest-plugin has to run tests on all supported branches. Currently it amounts to 16 (!) voting devstack jobs. With each of them have some small probability of a random failure, it is impossible to land anything without at least one recheck, usually more. The bad news is, we only run master API tests job, and these tests are changed more often that the other. We already had a minor stable branch breakage because of it [1]. We need to run 3 more jobs: for Pike, Queens and Rocky. And I've just spotted a missing master multinode job, which is defined but does not run for some reason :( Here is my proposal to deal with gate bloat on ironic-tempest-plugin: 1. Do not run CI jobs at all for unsupported branches and branches in extended maintenance. For Ocata this has already been done in [2]. 2. Make jobs running with N-3 (currently Pike) and older non-voting (and thus remove them from the gate queue). I have a gut feeling that a change that breaks N-3 is very likely to break N-2 (currently Queens) as well, so it's enough to have N-2 voting. 3. Make the discovery and the multinode jobs from all stable branches non-voting. These jobs cover the tests that get changed very infrequently (if ever). These are also the jobs with the highest random failure rate. 4. Add the API tests, voting for Queens to master, non-voting for Pike (as proposed above). This should leave us with 20 jobs, but with only 11 of them voting. Which is still a lot, but probably manageable. The corresponding change is [3], please comment here or there. Dmitry [1] https://review.openstack.org/622177 [2] https://review.openstack.org/621537 [3] https://review.openstack.org/627955
On 1/2/19 12:18 PM, Dmitry Tantsur wrote:
Hi all and happy new year :)
As you know, tempest plugins are branchless, so the CI of ironic-tempest-plugin has to run tests on all supported branches. Currently it amounts to 16 (!) voting devstack jobs. With each of them have some small probability of a random failure, it is impossible to land anything without at least one recheck, usually more.
The bad news is, we only run master API tests job, and these tests are changed more often that the other. We already had a minor stable branch breakage because of it [1]. We need to run 3 more jobs: for Pike, Queens and Rocky. And I've just spotted a missing master multinode job, which is defined but does not run for some reason :(
Better news: the API tests did not have a separate job before Rocky, so we only need to add Rocky. However, we'll get to 4 jobs in the future. The multinode job is missing because it was renamed on master, and apparently Zuul does not report it Oo
Here is my proposal to deal with gate bloat on ironic-tempest-plugin:
1. Do not run CI jobs at all for unsupported branches and branches in extended maintenance. For Ocata this has already been done in [2].
2. Make jobs running with N-3 (currently Pike) and older non-voting (and thus remove them from the gate queue). I have a gut feeling that a change that breaks N-3 is very likely to break N-2 (currently Queens) as well, so it's enough to have N-2 voting.
3. Make the discovery and the multinode jobs from all stable branches non-voting. These jobs cover the tests that get changed very infrequently (if ever). These are also the jobs with the highest random failure rate.
4. Add the API tests, voting for Queens to master, non-voting for Pike (as proposed above).
Only Rocky here for now.
This should leave us with 20 jobs, but with only 11 of them voting. Which is still a lot, but probably manageable.
The corresponding change is [3], please comment here or there.
Dmitry
[1] https://review.openstack.org/622177 [2] https://review.openstack.org/621537 [3] https://review.openstack.org/627955
On Wed, Jan 2, 2019, at 3:18 AM, Dmitry Tantsur wrote:
Hi all and happy new year :)
As you know, tempest plugins are branchless, so the CI of ironic- tempest-plugin has to run tests on all supported branches. Currently it amounts to 16 (!) voting devstack jobs. With each of them have some small probability of a random failure, it is impossible to land anything without at least one recheck, usually more.
The bad news is, we only run master API tests job, and these tests are changed more often that the other. We already had a minor stable branch breakage because of it [1]. We need to run 3 more jobs: for Pike, Queens and Rocky. And I've just spotted a missing master multinode job, which is defined but does not run for some reason :(
Here is my proposal to deal with gate bloat on ironic-tempest-plugin:
1. Do not run CI jobs at all for unsupported branches and branches in extended maintenance. For Ocata this has already been done in [2].
2. Make jobs running with N-3 (currently Pike) and older non-voting (and thus remove them from the gate queue). I have a gut feeling that a change that breaks N-3 is very likely to break N-2 (currently Queens) as well, so it's enough to have N-2 voting.
3. Make the discovery and the multinode jobs from all stable branches non-voting. These jobs cover the tests that get changed very infrequently (if ever). These are also the jobs with the highest random failure rate.
Has any work been done to investigate why these jobs fail? And if not maybe we should stop running the jobs entirely. Non voting jobs that aren't reliable will just get ignored.
4. Add the API tests, voting for Queens to master, non-voting for Pike (as proposed above).
This should leave us with 20 jobs, but with only 11 of them voting. Which is still a lot, but probably manageable.
The corresponding change is [3], please comment here or there.
Dmitry
[1] https://review.openstack.org/622177 [2] https://review.openstack.org/621537 [3] https://review.openstack.org/627955
On 1/2/19 7:24 PM, Clark Boylan wrote:
On Wed, Jan 2, 2019, at 3:18 AM, Dmitry Tantsur wrote:
Hi all and happy new year :)
As you know, tempest plugins are branchless, so the CI of ironic- tempest-plugin has to run tests on all supported branches. Currently it amounts to 16 (!) voting devstack jobs. With each of them have some small probability of a random failure, it is impossible to land anything without at least one recheck, usually more.
The bad news is, we only run master API tests job, and these tests are changed more often that the other. We already had a minor stable branch breakage because of it [1]. We need to run 3 more jobs: for Pike, Queens and Rocky. And I've just spotted a missing master multinode job, which is defined but does not run for some reason :(
Here is my proposal to deal with gate bloat on ironic-tempest-plugin:
1. Do not run CI jobs at all for unsupported branches and branches in extended maintenance. For Ocata this has already been done in [2].
2. Make jobs running with N-3 (currently Pike) and older non-voting (and thus remove them from the gate queue). I have a gut feeling that a change that breaks N-3 is very likely to break N-2 (currently Queens) as well, so it's enough to have N-2 voting.
3. Make the discovery and the multinode jobs from all stable branches non-voting. These jobs cover the tests that get changed very infrequently (if ever). These are also the jobs with the highest random failure rate.
Has any work been done to investigate why these jobs fail? And if not maybe we should stop running the jobs entirely. Non voting jobs that aren't reliable will just get ignored.
From my experience it's PXE failing or just generic timeout on slow nodes. Note that they still don't fail too often, it's their total number that makes it problematic. When you have 20 jobs each failing with, say, 5% rate it's just 35% chance of passing (unless I cannot do math). But to answer your question, yes, we do put work in that. We just never got to 0% of random failures.
4. Add the API tests, voting for Queens to master, non-voting for Pike (as proposed above).
This should leave us with 20 jobs, but with only 11 of them voting. Which is still a lot, but probably manageable.
The corresponding change is [3], please comment here or there.
Dmitry
[1] https://review.openstack.org/622177 [2] https://review.openstack.org/621537 [3] https://review.openstack.org/627955
---- On Thu, 03 Jan 2019 03:39:00 +0900 Dmitry Tantsur <dtantsur@redhat.com> wrote ----
On 1/2/19 7:24 PM, Clark Boylan wrote:
On Wed, Jan 2, 2019, at 3:18 AM, Dmitry Tantsur wrote:
Hi all and happy new year :)
As you know, tempest plugins are branchless, so the CI of ironic- tempest-plugin has to run tests on all supported branches. Currently it amounts to 16 (!) voting devstack jobs. With each of them have some small probability of a random failure, it is impossible to land anything without at least one recheck, usually more.
The bad news is, we only run master API tests job, and these tests are changed more often that the other. We already had a minor stable branch breakage because of it [1]. We need to run 3 more jobs: for Pike, Queens and Rocky. And I've just spotted a missing master multinode job, which is defined but does not run for some reason :(
Yeah, that is because ironic multinode's parent job "tempest-multinode-full" is restricted to run only on master. It was done that way until we had all multinode zuulv3 things backported till pike which is completed already. I am making this job for pike onwards [1] so that multinode job can be run on stable branches also.
Here is my proposal to deal with gate bloat on ironic-tempest-plugin:
1. Do not run CI jobs at all for unsupported branches and branches in extended maintenance. For Ocata this has already been done in [2].
2. Make jobs running with N-3 (currently Pike) and older non-voting (and thus remove them from the gate queue). I have a gut feeling that a change that breaks N-3 is very likely to break N-2 (currently Queens) as well, so it's enough to have N-2 voting.
3. Make the discovery and the multinode jobs from all stable branches non-voting. These jobs cover the tests that get changed very infrequently (if ever). These are also the jobs with the highest random failure rate.
Has any work been done to investigate why these jobs fail? And if not maybe we should stop running the jobs entirely. Non voting jobs that aren't reliable will just get ignored.
From my experience it's PXE failing or just generic timeout on slow nodes. Note that they still don't fail too often, it's their total number that makes it problematic. When you have 20 jobs each failing with, say, 5% rate it's just 35% chance of passing (unless I cannot do math).
But to answer your question, yes, we do put work in that. We just never got to 0% of random failures.
While making the multinode job running for stable branches, I got the consistent failure on multinode job for pike, queens which run fine on Rocky. Failure are on migration tests due to hostname mismatch. I have not debugged the failure yet but we will be making multinode runnable on stable branches also. [1] https://review.openstack.org/#/c/610938/ [2] https://review.openstack.org/#/q/topic:tempest-multinode-slow-stable+(status...) -gmann
4. Add the API tests, voting for Queens to master, non-voting for Pike (as proposed above).
This should leave us with 20 jobs, but with only 11 of them voting. Which is still a lot, but probably manageable.
The corresponding change is [3], please comment here or there.
Dmitry
[1] https://review.openstack.org/622177 [2] https://review.openstack.org/621537 [3] https://review.openstack.org/627955
---- On Wed, 02 Jan 2019 20:18:40 +0900 Dmitry Tantsur <dtantsur@redhat.com> wrote ----
Hi all and happy new year :)
As you know, tempest plugins are branchless, so the CI of ironic-tempest-plugin has to run tests on all supported branches. Currently it amounts to 16 (!) voting devstack jobs. With each of them have some small probability of a random failure, it is impossible to land anything without at least one recheck, usually more.
The bad news is, we only run master API tests job, and these tests are changed more often that the other. We already had a minor stable branch breakage because of it [1]. We need to run 3 more jobs: for Pike, Queens and Rocky. And I've just spotted a missing master multinode job, which is defined but does not run for some reason :(
Here is my proposal to deal with gate bloat on ironic-tempest-plugin:
1. Do not run CI jobs at all for unsupported branches and branches in extended maintenance. For Ocata this has already been done in [2].
+1. We have the same policy in Tempest also[1]. You mean not to run CI for unsupported/EM branches on the master testing right? CI on Unsupported/EM branch can be run until they all are passing or EM maintainers want to run them.
2. Make jobs running with N-3 (currently Pike) and older non-voting (and thus remove them from the gate queue). I have a gut feeling that a change that breaks N-3 is very likely to break N-2 (currently Queens) as well, so it's enough to have N-2 voting.
IMO, running all supported stable branches as voting make sense than running oldest one(N-3 as you mentioned) as n-v. That way, tempest-plugins will be successfully maintained to run on N-3 otherwise it is likely to be broken for that branch especially in case of feature discovery based tests.
3. Make the discovery and the multinode jobs from all stable branches non-voting. These jobs cover the tests that get changed very infrequently (if ever). These are also the jobs with the highest random failure rate.
4. Add the API tests, voting for Queens to master, non-voting for Pike (as proposed above).
This should leave us with 20 jobs, but with only 11 of them voting. Which is still a lot, but probably manageable.
The corresponding change is [3], please comment here or there.
Dmitry
[1] https://review.openstack.org/622177 [2] https://review.openstack.org/621537 [3] https://review.openstack.org/627955
[1] https://docs.openstack.org/tempest/latest/stable_branch_support_policy.html -gmann
participants (3)
-
Clark Boylan
-
Dmitry Tantsur
-
Ghanshyam Mann