[all][TC] Stats about rechecking patches without reason given
Hi, During the last PTG and after it, in the TC we were discussing about CI resources usage and about "rechecks" of the CI jobs (I know, it's again the same topic). One of the things we would like to limit, or even avoid is to do "no reason rechecks" which means writing quick comment "recheck" without checking what really was wrong in the previous run. We know that putting some hard rules that only comments with "recheck" with given reason will trigger new CI jobs run will not work fine as people may simply start writing any random things there. But we want to encourage all teams to at least to investigate failures and do as many rechecks with explanation as possible. For now I prepared simple script [1] which counts how much of all rechecks are "bare rechecks". It can be checked by project (like openstack/neutron) or give summary for all projects or teams (like Quality Assurance for example). I prepared some stats for all teams listed in the https://opendev.org/openstack/governance/src/branch/master/reference/projects.yaml[1] from last 30 days: +-------------------+---------------+--------------+-------------------+ | Team | Bare rechecks | All Rechecks | Bare rechecks [%] | +-------------------+---------------+--------------+-------------------+ | skyline | 20 | 20 | 100.0 | | magnum | 2 | 2 | 100.0 | | zun | 1 | 1 | 100.0 | | mistral | 9 | 9 | 100.0 | | ec2-api | 1 | 1 | 100.0 | | barbican | 15 | 15 | 100.0 | | venus | 2 | 2 | 100.0 | | solum | 1 | 1 | 100.0 | | tacker | 30 | 30 | 100.0 | | trove | 4 | 4 | 100.0 | | rally | 2 | 2 | 100.0 | | storlets | 5 | 5 | 100.0 | | winstackers | 3 | 3 | 100.0 | | OpenStack Charms | 32 | 33 | 96.97 | | sahara | 27 | 28 | 96.43 | | keystone | 24 | 25 | 96.0 | | kuryr | 120 | 126 | 95.24 | | kolla | 134 | 142 | 94.37 | | Puppet OpenStack | 94 | 103 | 91.26 | | cloudkitty | 10 | 11 | 90.91 | | OpenStack-Helm | 29 | 32 | 90.62 | | blazar | 8 | 9 | 88.89 | | tripleo | 563 | 646 | 87.15 | | requirements | 20 | 23 | 86.96 | | Telemetry | 30 | 35 | 85.71 | | horizon | 55 | 67 | 82.09 | | ironic | 131 | 164 | 79.88 | | oslo | 11 | 14 | 78.57 | | heat | 25 | 33 | 75.76 | | cinder | 221 | 294 | 75.17 | | cyborg | 6 | 8 | 75.0 | | murano | 3 | 4 | 75.0 | | glance | 20 | 27 | 74.07 | | OpenStackSDK | 47 | 64 | 73.44 | | manila | 108 | 160 | 67.5 | | neutron | 149 | 221 | 67.42 | | senlin | 2 | 3 | 66.67 | | swift | 16 | 25 | 64.0 | | Quality Assurance | 106 | 167 | 63.47 | | nova | 41 | 71 | 57.75 | | octavia | 32 | 60 | 53.33 | | designate | 19 | 39 | 48.72 | | OpenStackAnsible | 41 | 226 | 18.14 | +-------------------+---------------+--------------+-------------------+ As You can see from that list above, there is much to improve there. I hope that if teams will be checking more reasons of the CI failures, and reporting bugs found there, we may make our CI more stable and as a result have less rechecks which will save our infra resources :) [1] https://github.com/slawqo/rechecks-stats/blob/main/rechecks_stats/bare_rechecks.py[2] -- Slawek Kaplonski Principal Software Engineer Red Hat -------- [1] https://opendev.org/openstack/governance/src/branch/master/reference/project... [2] https://github.com/slawqo/rechecks-stats/blob/main/rechecks_stats/bare_reche...
Is it possible to adjust the script a bit in the future to add the amount of changes pushed/merged or some ratio of the amount of rechecks per merged patch? I think it would also be an interesting stat to see in addition to the amount of rechecks to understand how CI is stable or not. чт, 30 июн. 2022 г. в 11:12, Slawek Kaplonski <skaplons@redhat.com>:
Hi,
During the last PTG and after it, in the TC we were discussing about CI resources usage and about "rechecks" of the CI jobs (I know, it's again the same topic).
One of the things we would like to limit, or even avoid is to do "no reason rechecks" which means writing quick comment "recheck" without checking what really was wrong in the previous run.
We know that putting some hard rules that only comments with "recheck" with given reason will trigger new CI jobs run will not work fine as people may simply start writing any random things there. But we want to encourage all teams to at least to investigate failures and do as many rechecks with explanation as possible.
For now I prepared simple script [1] which counts how much of all rechecks are "bare rechecks". It can be checked by project (like openstack/neutron) or give summary for all projects or teams (like Quality Assurance for example). I prepared some stats for all teams listed in the https://opendev.org/openstack/governance/src/branch/master/reference/project... from last 30 days:
+-------------------+---------------+--------------+-------------------+
| Team | Bare rechecks | All Rechecks | Bare rechecks [%] |
+-------------------+---------------+--------------+-------------------+
| skyline | 20 | 20 | 100.0 |
| magnum | 2 | 2 | 100.0 |
| zun | 1 | 1 | 100.0 |
| mistral | 9 | 9 | 100.0 |
| ec2-api | 1 | 1 | 100.0 |
| barbican | 15 | 15 | 100.0 |
| venus | 2 | 2 | 100.0 |
| solum | 1 | 1 | 100.0 |
| tacker | 30 | 30 | 100.0 |
| trove | 4 | 4 | 100.0 |
| rally | 2 | 2 | 100.0 |
| storlets | 5 | 5 | 100.0 |
| winstackers | 3 | 3 | 100.0 |
| OpenStack Charms | 32 | 33 | 96.97 |
| sahara | 27 | 28 | 96.43 |
| keystone | 24 | 25 | 96.0 |
| kuryr | 120 | 126 | 95.24 |
| kolla | 134 | 142 | 94.37 |
| Puppet OpenStack | 94 | 103 | 91.26 |
| cloudkitty | 10 | 11 | 90.91 |
| OpenStack-Helm | 29 | 32 | 90.62 |
| blazar | 8 | 9 | 88.89 |
| tripleo | 563 | 646 | 87.15 |
| requirements | 20 | 23 | 86.96 |
| Telemetry | 30 | 35 | 85.71 |
| horizon | 55 | 67 | 82.09 |
| ironic | 131 | 164 | 79.88 |
| oslo | 11 | 14 | 78.57 |
| heat | 25 | 33 | 75.76 |
| cinder | 221 | 294 | 75.17 |
| cyborg | 6 | 8 | 75.0 |
| murano | 3 | 4 | 75.0 |
| glance | 20 | 27 | 74.07 |
| OpenStackSDK | 47 | 64 | 73.44 |
| manila | 108 | 160 | 67.5 |
| neutron | 149 | 221 | 67.42 |
| senlin | 2 | 3 | 66.67 |
| swift | 16 | 25 | 64.0 |
| Quality Assurance | 106 | 167 | 63.47 |
| nova | 41 | 71 | 57.75 |
| octavia | 32 | 60 | 53.33 |
| designate | 19 | 39 | 48.72 |
| OpenStackAnsible | 41 | 226 | 18.14 |
+-------------------+---------------+--------------+-------------------+
As You can see from that list above, there is much to improve there.
I hope that if teams will be checking more reasons of the CI failures, and reporting bugs found there, we may make our CI more stable and as a result have less rechecks which will save our infra resources :)
[1] https://github.com/slawqo/rechecks-stats/blob/main/rechecks_stats/bare_reche...
--
Slawek Kaplonski
Principal Software Engineer
Red Hat
On 2022-06-30 14:57:44 +0200 (+0200), Dmitriy Rabotyagov wrote:
Is it possible to adjust the script a bit in the future to add the amount of changes pushed/merged or some ratio of the amount of rechecks per merged patch? I think it would also be an interesting stat to see in addition to the amount of rechecks to understand how CI is stable or not. [...]
Recheck comment volume doesn't really provide an accurate measure of CI stability, all it tells you is how often people requested rerunning tests. Their reasons for doing it can be myriad, from not believing actual failures their changes are causing, to repeatedly rechecking successful results in hopes of reproducing some rare failure condition. -- Jeremy Stanley
On Thu, 2022-06-30 at 13:06 +0000, Jeremy Stanley wrote:
On 2022-06-30 14:57:44 +0200 (+0200), Dmitriy Rabotyagov wrote:
Is it possible to adjust the script a bit in the future to add the amount of changes pushed/merged or some ratio of the amount of rechecks per merged patch? I think it would also be an interesting stat to see in addition to the amount of rechecks to understand how CI is stable or not. [...]
Recheck comment volume doesn't really provide an accurate measure of CI stability, all it tells you is how often people requested rerunning tests. Their reasons for doing it can be myriad, from not believing actual failures their changes are causing, to repeatedly rechecking successful results in hopes of reproducing some rare failure condition.
yep we also recheck succeful result if we think we have fixed an intermint ci failure that we could not repoduced reliably but created a patch based on code inspection. in such a case we usually recheck 3 times looking for at least 3 consecitive check +1s before we +2w rearly is also recheck if a patch is old and the logs have rotaed when im reviewing others work but genrally i just click the rebase button in that case. for example i will tend to do +2 recheck if there are already cherry picks of the patch to avoid those having to be updated. but as i said this is rare as we dont ofthen have bugfixes that sit around for 3+ months that still actully apply with out a merge confilict but it does happen. so recheck is not a a great proxy for ci stablity without knowing the reason which is why not doing bare rechecks is important.
---- On Thu, 30 Jun 2022 08:37:47 -0500 Sean Mooney <smooney@redhat.com> wrote ---
On Thu, 2022-06-30 at 13:06 +0000, Jeremy Stanley wrote:
On 2022-06-30 14:57:44 +0200 (+0200), Dmitriy Rabotyagov wrote:
Is it possible to adjust the script a bit in the future to add the amount of changes pushed/merged or some ratio of the amount of rechecks per merged patch? I think it would also be an interesting stat to see in addition to the amount of rechecks to understand how CI is stable or not. [...]
Recheck comment volume doesn't really provide an accurate measure of CI stability, all it tells you is how often people requested rerunning tests. Their reasons for doing it can be myriad, from not believing actual failures their changes are causing, to repeatedly rechecking successful results in hopes of reproducing some rare failure condition.
yep we also recheck succeful result if we think we have fixed an intermint ci failure that we could not repoduced reliably but created a patch based on code inspection.
in such a case we usually recheck 3 times looking for at least 3 consecitive check +1s before we +2w
rearly is also recheck if a patch is old and the logs have rotaed when im reviewing others work but genrally i just click the rebase button in that case. for example i will tend to do +2 recheck if there are already cherry picks of the patch to avoid those having to be updated. but as i said this is rare as we dont ofthen have bugfixes that sit around for 3+ months that still actully apply with out a merge confilict but it does happen.
so recheck is not a a great proxy for ci stablity without knowing the reason which is why not doing bare rechecks is important.
I think having elastic-recheck up again will help us to check the CI stability and what bug is causing more issues. dpawlik is working on that. The TC idea about Slawek's script is to keep monitoring the bare recheck and build a culture in all OpenStack projects about not doing any bare recheck and at least finding out what bug is causing it. With that, we will be getting more attention to frequent bugs and asking the respective team to fix that. We in TC will be monitoring the same regularly and post which project is making more bare recheck. We request each project even your bare recheck is low add this monitoring in your weekly meeting also so that you keep giving the message and monitor too. I have added the same in QA meeting also. -gmann
Hi, Dnia czwartek, 30 czerwca 2022 15:37:47 CEST Sean Mooney pisze:
On Thu, 2022-06-30 at 13:06 +0000, Jeremy Stanley wrote:
On 2022-06-30 14:57:44 +0200 (+0200), Dmitriy Rabotyagov wrote:
Is it possible to adjust the script a bit in the future to add the amount of changes pushed/merged or some ratio of the amount of rechecks per merged patch? I think it would also be an interesting stat to see in addition to the amount of rechecks to understand how CI is stable or not. [...]
Recheck comment volume doesn't really provide an accurate measure of CI stability, all it tells you is how often people requested rerunning tests. Their reasons for doing it can be myriad, from not believing actual failures their changes are causing, to repeatedly rechecking successful results in hopes of reproducing some rare failure condition.
yep we also recheck succeful result if we think we have fixed an intermint ci failure that we could not repoduced reliably but created a patch based on code inspection.
in such a case we usually recheck 3 times looking for at least 3 consecitive check +1s before we +2w
rearly is also recheck if a patch is old and the logs have rotaed when im reviewing others work but genrally i just click the rebase button in that case. for example i will tend to do +2 recheck if there are already cherry picks of the patch to avoid those having to be updated. but as i said this is rare as we dont ofthen have bugfixes that sit around for 3+ months that still actully apply with out a merge confilict but it does happen.
so recheck is not a a great proxy for ci stablity without knowing the reason which is why not doing bare rechecks is important.
That's true. The reason why I did script to check "bare" rechecks is to see how often people just do "recheck" without even checking reason of failures. For CI stability, some time ago I did another script https://github.com/slawqo/rechecks-stats/blob/main/rechecks_stats/rechecks.py[1] which checks only merged patches and counts number of "Failed build" comments from Zuul on the last, merged patch set. That is also not perfect metric for sure but can give IMO better view of the CI stability as it will not count rechecks of the passed CI run to see intermittent failures or issues caused by the patch itself.
-- Slawek Kaplonski Principal Software Engineer Red Hat -------- [1] https://github.com/slawqo/rechecks-stats/blob/main/rechecks_stats/rechecks.p...
I think I need to rephrase myself a bit. Like if you have 100 patches merged and 2 rechecks, even if all of them are bare, doesn't mean that developers don't care about resources. It's more then they are so sure in their tests stability, that they absolutely sure it's infra failure. Or vice versa, if there are 20 rechecks for 2 patches, even if neither of them are bare, it's still weird and smth worth reconsidering from project perspective. I hope I explained better now what I meant. чт, 30 июн. 2022 г., 16:56 Slawek Kaplonski <skaplons@redhat.com>:
Hi,
Dnia czwartek, 30 czerwca 2022 15:37:47 CEST Sean Mooney pisze:
On Thu, 2022-06-30 at 13:06 +0000, Jeremy Stanley wrote:
On 2022-06-30 14:57:44 +0200 (+0200), Dmitriy Rabotyagov wrote:
Is it possible to adjust the script a bit in the future to add the
amount of changes pushed/merged or some ratio of the amount of
rechecks per merged patch? I think it would also be an interesting
stat to see in addition to the amount of rechecks to understand how CI
is stable or not.
[...]
Recheck comment volume doesn't really provide an accurate measure of
CI stability, all it tells you is how often people requested
rerunning tests. Their reasons for doing it can be myriad, from not
believing actual failures their changes are causing, to repeatedly
rechecking successful results in hopes of reproducing some rare
failure condition.
yep we also recheck succeful result if we think we have fixed an intermint
ci failure that we could not repoduced reliably but created a patch based on code inspection.
in such a case we usually recheck 3 times looking for at least 3 consecitive check +1s before we +2w
rearly is also recheck if a patch is old and the logs have rotaed when im reviewing others work
but genrally i just click the rebase button in that case. for example i will tend to do +2 recheck
if there are already cherry picks of the patch to avoid those having to be updated. but as i said this is
rare as we dont ofthen have bugfixes that sit around for 3+ months that still actully apply with out a merge confilict
but it does happen.
so recheck is not a a great proxy for ci stablity without knowing the reason which is why not doing bare rechecks is important.
That's true. The reason why I did script to check "bare" rechecks is to see how often people just do "recheck" without even checking reason of failures.
For CI stability, some time ago I did another script https://github.com/slawqo/rechecks-stats/blob/main/rechecks_stats/rechecks.p... which checks only merged patches and counts number of "Failed build" comments from Zuul on the last, merged patch set. That is also not perfect metric for sure but can give IMO better view of the CI stability as it will not count rechecks of the passed CI run to see intermittent failures or issues caused by the patch itself.
--
Slawek Kaplonski
Principal Software Engineer
Red Hat
On Thu, Jun 30, 2022, at 10:34 AM, Dmitriy Rabotyagov wrote:
I think I need to rephrase myself a bit.
Like if you have 100 patches merged and 2 rechecks, even if all of them are bare, doesn't mean that developers don't care about resources. It's more then they are so sure in their tests stability, that they absolutely sure it's infra failure.
I'm not sure I understand why certain infra failures don't deserve a note recording why the recheck was necessary if other failures do.
Or vice versa, if there are 20 rechecks for 2 patches, even if neither of them are bare, it's still weird and smth worth reconsidering from project perspective.
I think the idea is to create a culture of debugging and record keeping. Yes, I would expect after a few rechecks that maybe the root causes would be addressed in this case, but the first step in doing that is identifying the problem and making note of it.
I hope I explained better now what I meant.
чт, 30 июн. 2022 г., 16:56 Slawek Kaplonski <skaplons@redhat.com>:
Hi,
Dnia czwartek, 30 czerwca 2022 15:37:47 CEST Sean Mooney pisze:
On Thu, 2022-06-30 at 13:06 +0000, Jeremy Stanley wrote:
On 2022-06-30 14:57:44 +0200 (+0200), Dmitriy Rabotyagov wrote:
Is it possible to adjust the script a bit in the future to add the
amount of changes pushed/merged or some ratio of the amount of
rechecks per merged patch? I think it would also be an interesting
stat to see in addition to the amount of rechecks to understand how CI
is stable or not.
[...]
Recheck comment volume doesn't really provide an accurate measure of
CI stability, all it tells you is how often people requested
rerunning tests. Their reasons for doing it can be myriad, from not
believing actual failures their changes are causing, to repeatedly
rechecking successful results in hopes of reproducing some rare
failure condition.
yep we also recheck succeful result if we think we have fixed an intermint
ci failure that we could not repoduced reliably but created a patch based on code inspection.
in such a case we usually recheck 3 times looking for at least 3 consecitive check +1s before we +2w
rearly is also recheck if a patch is old and the logs have rotaed when im reviewing others work
but genrally i just click the rebase button in that case. for example i will tend to do +2 recheck
if there are already cherry picks of the patch to avoid those having to be updated. but as i said this is
rare as we dont ofthen have bugfixes that sit around for 3+ months that still actully apply with out a merge confilict
but it does happen.
so recheck is not a a great proxy for ci stablity without knowing the reason which is why not doing bare rechecks is important.
That's true. The reason why I did script to check "bare" rechecks is to see how often people just do "recheck" without even checking reason of failures.
For CI stability, some time ago I did another script https://github.com/slawqo/rechecks-stats/blob/main/rechecks_stats/rechecks.p... which checks only merged patches and counts number of "Failed build" comments from Zuul on the last, merged patch set. That is also not perfect metric for sure but can give IMO better view of the CI stability as it will not count rechecks of the passed CI run to see intermittent failures or issues caused by the patch itself.
--
Slawek Kaplonski
Principal Software Engineer
Red Hat
Or vice versa, if there are 20 rechecks for 2 patches, even if neither of them are bare, it's still weird and smth worth reconsidering from project perspective.
I think the idea is to create a culture of debugging and record keeping. Yes, I would expect after a few rechecks that maybe the root causes would be addressed in this case, but the first step in doing that is identifying the problem and making note of it.
Right, that is the goal. Asking for a message at least sets the expectation that people are looking at the reasons for the fails. Just because they don't doesn't mean they aren't, or don't care, but I think it helps reinforce the desired behavior. If nothing else, it also helps observers realize "huh, I've seen a bunch of rechecks about $reason lately, maybe we should look at that". --Dan
On 30/06/2022 20:06, Dan Smith wrote:
Or vice versa, if there are 20 rechecks for 2 patches, even if neither of them are bare, it's still weird and smth worth reconsidering from project perspective.
I think the idea is to create a culture of debugging and record keeping. Yes, I would expect after a few rechecks that maybe the root causes would be addressed in this case, but the first step in doing that is identifying the problem and making note of it.
Right, that is the goal. Asking for a message at least sets the expectation that people are looking at the reasons for the fails. Just because they don't doesn't mean they aren't, or don't care, but I think it helps reinforce the desired behavior. If nothing else, it also helps observers realize "huh, I've seen a bunch of rechecks about $reason lately, maybe we should look at that".
So, what happens with the script, when you add 2 comments, one: "network error during package install, let's try again" and the next message "recheck". In my understanding, that would count as recheck without reason given. (by the script). Maybe it's worth to document how to give a better proof that someone looked into the logs and tried to get to the root cause of a previous CI failure? The other issue I see here is that with CI being flaky, chances seem to get better when doing a recheck. An extreme example: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/844519 required 8 rechecks, no changes in the patch itself, and no dependencies. The CI failed always in different checks. Matthias
---- On Fri, 01 Jul 2022 01:46:23 -0500 Matthias Runge <mrunge@matthias-runge.de> wrote ---
On 30/06/2022 20:06, Dan Smith wrote:
Or vice versa, if there are 20 rechecks for 2 patches, even if neither of them are bare, it's still weird and smth worth reconsidering from project perspective.
I think the idea is to create a culture of debugging and record keeping. Yes, I would expect after a few rechecks that maybe the root causes would be addressed in this case, but the first step in doing that is identifying the problem and making note of it.
Right, that is the goal. Asking for a message at least sets the expectation that people are looking at the reasons for the fails. Just because they don't doesn't mean they aren't, or don't care, but I think it helps reinforce the desired behavior. If nothing else, it also helps observers realize "huh, I've seen a bunch of rechecks about $reason lately, maybe we should look at that".
So, what happens with the script, when you add 2 comments, one: "network error during package install, let's try again" and the next message "recheck".
In this case, you can always mentione the "recheck network error during package install, let's try again" or if you have added a lenthy text for failure and then want to recheck you can add a one line sumamry during recheck. Overall idea is not to literally count the bare recheck but to build a habbit among us that we should look at the failure before we just do recheck.
In my understanding, that would count as recheck without reason given. (by the script). Maybe it's worth to document how to give a better proof that someone looked into the logs and tried to get to the root cause of a previous CI failure?
I think Dan has written a nice document about it including how to debug the failure, - https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes... We welcome everyone to extend it to have more detail or example if any case specific to projects and it is not covered. -gmann
The other issue I see here is that with CI being flaky, chances seem to get better when doing a recheck.
An extreme example: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/844519 required 8 rechecks, no changes in the patch itself, and no dependencies. The CI failed always in different checks.
Matthias
participants (8)
-
Clark Boylan
-
Dan Smith
-
Dmitriy Rabotyagov
-
Ghanshyam Mann
-
Jeremy Stanley
-
Matthias Runge
-
Sean Mooney
-
Slawek Kaplonski