Hi Stackers, Some time ago I started monitoring something what we have called "bare rechecks" which are "recheck" comments without any reason given. I was also asking various teams to improve that, try to check failures of the CI jobs and put some kind of reason in the recheck comment. That went pretty well and currently most of the teams are doing pretty good job there. See stats from last 30 days: +--------------------+---------------+--------------+-------------------+ | Team | Bare rechecks | All Rechecks | Bare rechecks [%] | +--------------------+---------------+--------------+-------------------+ | OpenStackSDK | 8 | 8 | 100.0 | | OpenStack-Helm | 41 | 41 | 100.0 | | cloudkitty | 1 | 1 | 100.0 | | monasca | 6 | 6 | 100.0 | | skyline | 2 | 2 | 100.0 | | trove | 1 | 1 | 100.0 | | rally | 1 | 1 | 100.0 | | watcher | 1 | 1 | 100.0 | | mistral | 5 | 8 | 62.5 | | kuryr | 1 | 2 | 50.0 | | zaqar | 1 | 2 | 50.0 | | horizon | 6 | 15 | 40.0 | | swift | 9 | 25 | 36.0 | | manila | 5 | 14 | 35.71 | | kolla | 14 | 68 | 20.59 | | cinder | 15 | 77 | 19.48 | | ironic | 10 | 53 | 18.87 | | glance | 11 | 59 | 18.64 | | requirements | 13 | 73 | 17.81 | | Telemetry | 2 | 12 | 16.67 | | nova | 16 | 105 | 15.24 | | OpenStackAnsible | 5 | 40 | 12.5 | | magnum | 1 | 9 | 11.11 | | neutron | 13 | 118 | 11.02 | | Quality Assurance | 3 | 30 | 10.0 | | Release Management | 1 | 12 | 8.33 | | octavia | 2 | 34 | 5.88 | | keystone | 0 | 1 | 0.0 | | OpenStack Charms | 0 | 1 | 0.0 | | oslo | 0 | 8 | 0.0 | | Puppet OpenStack | 0 | 95 | 0.0 | | barbican | 0 | 2 | 0.0 | | designate | 0 | 4 | 0.0 | | heat | 0 | 7 | 0.0 | | tacker | 0 | 63 | 0.0 | | tripleo | 0 | 4 | 0.0 | +--------------------+---------------+--------------+-------------------+ Now I though that it's time for the "next step" so I spent some time looking at the reasons of the rechecks done in last 30 days. I collected those reasons with script [1] and here are some top reasons found: +----------------------------------------------------------+----------------------+ | Recheck comment | Number of occurences | +----------------------------------------------------------+----------------------+ | recheck unrelated failure | 48 | | recheck irrelevant failure | 45 | | recheck unrelated test failure | 9 | | recheck testing | 9 | | recheck infra-failure | 9 | | recheck tempest-slow-py3 | 8 | | recheck nova-next | 8 | | recheck timeout | 7 | | recheck neutron-functional-with-uwsgi | 7 | | recheck failure not related to this change | 7 | | recheck ovn-octavia-provider-functional-master unrelated | 7 | | recheck jobs running too early | 6 | | recheck dependency updated | 5 | | recheck random failure | 5 | | recheck - bifrost bug | 5 | +----------------------------------------------------------+----------------------+ Those comments as they were put in the gerrit. Of course there was much more different comments there. I just wanted to highlight here some of the most common ones. I also asked ChatGPT to group all of them in some common groups. Below are some of the main categories of the reason found by the ChatGPT: CI/Testing Issues: Timeouts in various tests (e.g., tempest tests, tox tests). Infrastructure failures impacting CI. Kernel panics or guest panics during testing. Dependency-related failures (merged or updated). Unexpected failures or intermittent issues. Dependency Management: Dependency patches merged or updated. Dependency-related failures. Release/Documentation Updates: Release note fixes or updates. Content cleanup or removal. Documentation-related failures. Infrastructure/Tooling: Infrastructure mirror issues. Tooling updates or fixes (e.g., grub2-tools, proliantutils). Issues related to mirror servers. Specific Project/Component Issues: Nova-related failures (e.g., nova-ceph-multistore). Cinder-related failures (e.g., cinder-plugin-ceph-tempest). OpenStack SDK related failures. Configuration/Setup Problems: Problems with deployment or setup (e.g., grub2-tools missing, bifrost bug). Issues with configurations (e.g., host patterns). Network/Connection Problems: DNS resolution issues. Timeout errors related to network operations. Post-Failure Analysis: Attempts to reproduce failures. Investigation into infrastructural problems. Finally some conclusion from this analysis. It seems that our most common reason of the gate instability is something called "unrelated/irrelevant failure" and also things like "infra failure". In both cases we can't really do much to try to improve that as this information is not enough. So I would like to ask all of You to try to be more precise in the recheck comments and give more detailed reasons why You recheck Your patch. If for example You have to recheck the patch due to "unrelated failure", please write there also bug number which cause failure or simply give there name of the unrelated failed test or something like that. That way when I will do similar analysis in e.g. month from now maybe it will be easier to identify most common bugs which are causing "unrelated failures" and try to prioritize somehow work on fix of that bug. [1] https://github.com/slawqo/rechecks-stats/blob/main/rechecks_stats/rechecks_r... -- Slawek Kaplonski Principal Software Engineer Red Hat
Am Dienstag, dem 19.03.2024 um 10:04 +0100 schrieb Sławek Kapłoński:
Hi Stackers,
Hi Sławek, what I always wanted to ask: There must be a specific reason why bare rechecks are allowed at all? Why don't we simply enforce that there always must be a reason given? Of course we can't enforce a meaningful reason being stated, but this is already the case now, so it would not get worse if we just disabled the possibility for bare rechecks, no? Thanks for doing all this work! -- Sven Kieske Senior Cloud Engineer Mail: kieske@osism.tech Web: https://osism.tech OSISM GmbH Teckstraße 62 / 70190 Stuttgart / Deutschland Geschäftsführer: Christian Berendt Unternehmenssitz: Stuttgart Amtsgericht: Stuttgart, HRB 756139
On 2024-03-21 15:56:42 +0100 (+0100), Sven Kieske wrote: [...]
There must be a specific reason why bare rechecks are allowed at all? Why don't we simply enforce that there always must be a reason given?
Of course we can't enforce a meaningful reason being stated, but this is already the case now, so it would not get worse if we just disabled the possibility for bare rechecks, no? [...]
There was a time when we did exactly that, it lasted several years and the end result did not yield any measurable improvement in data quality. In fact, at one point we got restrictive enough to require bug numbers and the outcome was that people either made up nonexistent bug numbers or just put in any old bug they knew the number for regardless of whether it was related to the failure. Yes it's been a while so I can't say for certain that the results would be the same if we tried again, but I don't have a good reason to believe it would turn out any different. Also, bear in mind, the pipeline trigger patterns apply to the entire Zuul tenant used by the OpenStack project, which is currently shared by any other projects outside OpenStack's governance, so if this change were enforced (again) it would disrupt their contributors' workflows as well. -- Jeremy Stanley
On 2024-03-21 15:23:21 +0000 (+0000), Jeremy Stanley wrote: [...]
the pipeline trigger patterns apply to the entire Zuul tenant used by the OpenStack project, which is currently shared by any other projects outside OpenStack's governance [...]
Sorry, that was supposed to be "shared by many other projects." -- Jeremy Stanley
Hi, Dnia czwartek, 21 marca 2024 16:23:21 CET Jeremy Stanley pisze:
On 2024-03-21 15:56:42 +0100 (+0100), Sven Kieske wrote: [...]
There must be a specific reason why bare rechecks are allowed at all? Why don't we simply enforce that there always must be a reason given?
Of course we can't enforce a meaningful reason being stated, but this is already the case now, so it would not get worse if we just disabled the possibility for bare rechecks, no? [...]
There was a time when we did exactly that, it lasted several years and the end result did not yield any measurable improvement in data quality. In fact, at one point we got restrictive enough to require bug numbers and the outcome was that people either made up nonexistent bug numbers or just put in any old bug they knew the number for regardless of whether it was related to the failure.
Yes it's been a while so I can't say for certain that the results would be the same if we tried again, but I don't have a good reason to believe it would turn out any different. Also, bear in mind, the pipeline trigger patterns apply to the entire Zuul tenant used by the OpenStack project, which is currently shared by any other projects outside OpenStack's governance, so if this change were enforced (again) it would disrupt their contributors' workflows as well. -- Jeremy Stanley
I agree with Jeremy here. We know that enforcing don't really work well and that's why we are trying to educate more :) -- Slawek Kaplonski Principal Software Engineer Red Hat
Sven, bare rechecks can't be disabled, because it's hard to check if the meaningful reason is provided. Enforcing specify the reason will lead to the commands like "recheck failed" or "recheck lets check" etc. On Thu, Mar 21, 2024 at 7:10 PM Sławek Kapłoński <skaplons@redhat.com> wrote:
Hi,
Dnia czwartek, 21 marca 2024 16:23:21 CET Jeremy Stanley pisze:
On 2024-03-21 15:56:42 +0100 (+0100), Sven Kieske wrote: [...]
There must be a specific reason why bare rechecks are allowed at all? Why don't we simply enforce that there always must be a reason given?
Of course we can't enforce a meaningful reason being stated, but this is already the case now, so it would not get worse if we just disabled the possibility for bare rechecks, no? [...]
There was a time when we did exactly that, it lasted several years and the end result did not yield any measurable improvement in data quality. In fact, at one point we got restrictive enough to require bug numbers and the outcome was that people either made up nonexistent bug numbers or just put in any old bug they knew the number for regardless of whether it was related to the failure.
Yes it's been a while so I can't say for certain that the results would be the same if we tried again, but I don't have a good reason to believe it would turn out any different. Also, bear in mind, the pipeline trigger patterns apply to the entire Zuul tenant used by the OpenStack project, which is currently shared by any other projects outside OpenStack's governance, so if this change were enforced (again) it would disrupt their contributors' workflows as well. -- Jeremy Stanley
I agree with Jeremy here. We know that enforcing don't really work well and that's why we are trying to educate more :)
-- Slawek Kaplonski Principal Software Engineer Red Hat
-- Regards, Maksim Malchuk
im not aware of all th things on the TCs plate right now but this feels to me liek somethign we should not be spending time on. the contributor that do most of the work day to day already know this policy. for new continuator we try to tell them about it when we see bare rechecks but this always feels like we are perching to the choir. i think it would be better to just stop tracking this and i don't think enforcing this in code is a good thing either. people that don't care will just work around it so unless we are going to soft ban an account for a few days or something like that i dont think it will have much impact. i dont think we need to take the ban hammer out to people that do bare rechecks but after a few years of advertising this policy now this feels more like sapm to me then actully something that will have a positive effect. maybe im just being cynical but if we make recheck hard people will just work around it by maing a trivial change to the patch or hitting the rebase button in the ui instead and get the same effect... that my perspective anyway but i dont think this help our comuntity be more welcoming or enjoyable to work with. ci is a share resouce we should not squander but this topic just draing my energy when i see it come up in team meeting or the mailing list. On Thu, 2024-03-21 at 20:39 +0300, Maksim Malchuk wrote:
Sven, bare rechecks can't be disabled, because it's hard to check if the meaningful reason is provided. Enforcing specify the reason will lead to the commands like "recheck failed" or "recheck lets check" etc.
On Thu, Mar 21, 2024 at 7:10 PM Sławek Kapłoński <skaplons@redhat.com> wrote:
Hi,
Dnia czwartek, 21 marca 2024 16:23:21 CET Jeremy Stanley pisze:
On 2024-03-21 15:56:42 +0100 (+0100), Sven Kieske wrote: [...]
There must be a specific reason why bare rechecks are allowed at all? Why don't we simply enforce that there always must be a reason given?
Of course we can't enforce a meaningful reason being stated, but this is already the case now, so it would not get worse if we just disabled the possibility for bare rechecks, no? [...]
There was a time when we did exactly that, it lasted several years and the end result did not yield any measurable improvement in data quality. In fact, at one point we got restrictive enough to require bug numbers and the outcome was that people either made up nonexistent bug numbers or just put in any old bug they knew the number for regardless of whether it was related to the failure.
Yes it's been a while so I can't say for certain that the results would be the same if we tried again, but I don't have a good reason to believe it would turn out any different. Also, bear in mind, the pipeline trigger patterns apply to the entire Zuul tenant used by the OpenStack project, which is currently shared by any other projects outside OpenStack's governance, so if this change were enforced (again) it would disrupt their contributors' workflows as well. -- Jeremy Stanley
I agree with Jeremy here. We know that enforcing don't really work well and that's why we are trying to educate more :)
-- Slawek Kaplonski Principal Software Engineer Red Hat
i think it would be better to just stop tracking this and i don't think enforcing this in code is a good thing either. people that don't care will just work around it so unless we are going to soft ban an account for a few days or something like that i dont think it will have much impact.
I think maybe you’re focusing on the tracking of the reason and less on the overall goal. The reason to actually obsess over why people are rechecking, and also to ask them to provide a reason, is not really so we can correlate reasons to fixes (at least IMHO). As Jeremy has said, that hasn’t borne fruit in the past, despite us very much wishing it would. To me, the point of doing this exercise is to strongly encourage people to look at the cause of the failure _at_all_. Meaning, open the logs, check the test reports, and at least synthesize some “reason” that indicates that they even looked. That has the benefit of easing people into figuring out how to debug CI issues, and caring about the high failure rates. Maybe while they’re doing that they’ll see a traceback that looks fishy, or actually find something that is out of step with reality. It doesn’t always of course, but I think we have in the past gotten stuck in situations where people just recheck if they get a -1 from zuul and just assume that it’s not their fault. That collective dis-ownership of the problem is a disease that leads us to a completely non-functional gate. I’ve caught people multiple times rechecking failures that clearly show a test they added, or a test their code changes, failing the same way over and over. Even in less-obvious situations, much can be gained by at least exposing people to the causes instead of just having them assume someone else will fix it. So, per the above, I wholeheartedly disagree that this is a useless exercise. Since we started doing this, I’ve definitely noticed (admittedly, anecdotally) more collective awareness of the kinds of issues that cause us trouble. I really hope we don’t “stop tracking this” for that reason. —-Dan
Hi, Dnia czwartek, 21 marca 2024 20:19:18 CET smooney@redhat.com pisze:
im not aware of all th things on the TCs plate right now but this feels to me liek somethign we should not be spending time on. the contributor that do most of the work day to day already know this policy. for new continuator we try to tell them about it when we see bare rechecks but this always feels like we are perching to the choir.
i think it would be better to just stop tracking this and i don't think enforcing this in code is a good thing either. people that don't care will just work around it so unless we are going to soft ban an account for a few days or something like that i dont think it will have much impact.
i dont think we need to take the ban hammer out to people that do bare rechecks but after a few years of advertising this policy now this feels more like sapm to me then actully something that will have a positive effect.
I just want to explain one thing here. It was never my intention to ban anyone or to enforce anything. My email is only to ask people to maybe try to improve those recheck comments a bit. But it's totally fine if people in some project will not do that at all. They still can do bare rechecks if they want to.
maybe im just being cynical but if we make recheck hard people will just work around it by maing a trivial change to the patch or hitting the rebase button in the ui instead and get the same effect...
that my perspective anyway but i dont think this help our comuntity be more welcoming or enjoyable to work with. ci is a share resouce we should not squander but this topic just draing my energy when i see it come up in team meeting or the mailing list.
On Thu, 2024-03-21 at 20:39 +0300, Maksim Malchuk wrote:
Sven, bare rechecks can't be disabled, because it's hard to check if the meaningful reason is provided. Enforcing specify the reason will lead to the commands like "recheck failed" or "recheck lets check" etc.
On Thu, Mar 21, 2024 at 7:10 PM Sławek Kapłoński <skaplons@redhat.com> wrote:
Hi,
Dnia czwartek, 21 marca 2024 16:23:21 CET Jeremy Stanley pisze:
On 2024-03-21 15:56:42 +0100 (+0100), Sven Kieske wrote: [...]
There must be a specific reason why bare rechecks are allowed at all? Why don't we simply enforce that there always must be a reason given?
Of course we can't enforce a meaningful reason being stated, but this is already the case now, so it would not get worse if we just disabled the possibility for bare rechecks, no? [...]
There was a time when we did exactly that, it lasted several years and the end result did not yield any measurable improvement in data quality. In fact, at one point we got restrictive enough to require bug numbers and the outcome was that people either made up nonexistent bug numbers or just put in any old bug they knew the number for regardless of whether it was related to the failure.
Yes it's been a while so I can't say for certain that the results would be the same if we tried again, but I don't have a good reason to believe it would turn out any different. Also, bear in mind, the pipeline trigger patterns apply to the entire Zuul tenant used by the OpenStack project, which is currently shared by any other projects outside OpenStack's governance, so if this change were enforced (again) it would disrupt their contributors' workflows as well. -- Jeremy Stanley
I agree with Jeremy here. We know that enforcing don't really work well and that's why we are trying to educate more :)
-- Slawek Kaplonski Principal Software Engineer Red Hat
-- Slawek Kaplonski Principal Software Engineer Red Hat
On Fri, 2024-03-22 at 13:18 +0100, Sławek Kapłoński wrote:
Hi,
Dnia czwartek, 21 marca 2024 20:19:18 CET smooney@redhat.com pisze:
im not aware of all th things on the TCs plate right now but this feels to me liek somethign we should not be spending time on. the contributor that do most of the work day to day already know this policy. for new continuator we try to tell them about it when we see bare rechecks but this always feels like we are perching to the choir.
i think it would be better to just stop tracking this and i don't think enforcing this in code is a good thing either. people that don't care will just work around it so unless we are going to soft ban an account for a few days or something like that i dont think it will have much impact.
i dont think we need to take the ban hammer out to people that do bare rechecks but after a few years of advertising this policy now this feels more like sapm to me then actully something that will have a positive effect.
I just want to explain one thing here. It was never my intention to ban anyone or to enforce anything. My email is only to ask people to maybe try to improve those recheck comments a bit. But it's totally fine if people in some project will not do that at all. They still can do bare rechecks if they want to.
well just looping back to dan's reply to my previosus email. it seams like the intent this time is to understand why recheck are being done not to stop peopel rechecking the irritation i was expressing was because previous time we asked peopel not to recheck with out a reason was not implemented as asking people to trying and see if they can fix the underlying problem. repeating * STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes... in the gate status section of our meeting every week and not actually trying to understand the reasons people were giving for a recheck was unhelpful and franckly quite annoying. For reasons for the last 18 month my upstream time has been severly reduced and has almost exclusively been code review and gate fixes. my persecption of the previous incarndation fo this effort in 2022 was the focus was not on actully fixing the problems but trackign the rechecks and reducign them to save ci resouce ranther then improvig ci stablity. it sound like the intent of the new effort is to actully fix the issues by understanding why peopel are rechecking and discuraging blind rejects is just a side effect to ensure we have good data to take action on. one of the things that was indetiged in the list was kernel panics in the guest one way to reduce that is if you have a projec ton this list https://codesearch.opendev.org/?q=cirros-0.5.2&i=nope&literal=nope&files=&excludeFiles=&repos= please move to cirros 6 or at least 0.5.3 cirros 0.5.2 has a know kernel bug that casues random guest panics that is fixed in both 0.5.3 and the 6.x series i aware of the ironly of a nova core saying that when nova ahse one freence to a 0.5.2 image but that one is for arm and does not (that we have seen) hit the issue thats in the x86 one but its on my todo list to fix. im all for using recheck reasons as a data soruce to try and identify and fix ci issue but when i saw in our team meeting this week it very triggering as we prviously had agreed to stop doing that as a team * please avoid bare rechecks (bauzas, 16:12:14) * ACTION: bauzas to tell about the bare rechecks every week in our meeting (bauzas, 16:15:26) ill chat to bauzas about how we can phase this reminder to make it cleare the focus is not about stoping blind recheck its about understanding why a recheck was required and fixing that issue and providing a reasons is the minimal first step in that process.
maybe im just being cynical but if we make recheck hard people will just work around it by maing a trivial change to the patch or hitting the rebase button in the ui instead and get the same effect...
that my perspective anyway but i dont think this help our comuntity be more welcoming or enjoyable to work with. ci is a share resouce we should not squander but this topic just draing my energy when i see it come up in team meeting or the mailing list.
On Thu, 2024-03-21 at 20:39 +0300, Maksim Malchuk wrote:
Sven, bare rechecks can't be disabled, because it's hard to check if the meaningful reason is provided. Enforcing specify the reason will lead to the commands like "recheck failed" or "recheck lets check" etc.
On Thu, Mar 21, 2024 at 7:10 PM Sławek Kapłoński <skaplons@redhat.com> wrote:
Hi,
Dnia czwartek, 21 marca 2024 16:23:21 CET Jeremy Stanley pisze:
On 2024-03-21 15:56:42 +0100 (+0100), Sven Kieske wrote: [...]
There must be a specific reason why bare rechecks are allowed at all? Why don't we simply enforce that there always must be a reason given?
Of course we can't enforce a meaningful reason being stated, but this is already the case now, so it would not get worse if we just disabled the possibility for bare rechecks, no? [...]
There was a time when we did exactly that, it lasted several years and the end result did not yield any measurable improvement in data quality. In fact, at one point we got restrictive enough to require bug numbers and the outcome was that people either made up nonexistent bug numbers or just put in any old bug they knew the number for regardless of whether it was related to the failure.
Yes it's been a while so I can't say for certain that the results would be the same if we tried again, but I don't have a good reason to believe it would turn out any different. Also, bear in mind, the pipeline trigger patterns apply to the entire Zuul tenant used by the OpenStack project, which is currently shared by any other projects outside OpenStack's governance, so if this change were enforced (again) it would disrupt their contributors' workflows as well. -- Jeremy Stanley
I agree with Jeremy here. We know that enforcing don't really work well and that's why we are trying to educate more :)
-- Slawek Kaplonski Principal Software Engineer Red Hat
the irritation i was expressing was because previous time we asked peopel not to recheck with out a reason was not implemented as asking people to trying and see if they can fix the underlying problem.
repeating * STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes...
in the gate status section of our meeting every week and not actually trying to understand the reasons people were giving for a recheck was unhelpful and franckly quite annoying.
That document has lots of good examples and suggestions for exactly what we’re trying to do: get people to examine the failures. The most common thing I hear when I confront people about why they’re rechecking is that they “don’t know how to debug this stuff”. So I think pointing them at a document that gives them pointers and places to start is pretty good.
but when i saw in our team meeting this week it very triggering as we prviously had agreed to stop doing that as a team
* please avoid bare rechecks (bauzas, 16:12:14) * ACTION: bauzas to tell about the bare rechecks every week in our meeting (bauzas, 16:15:26)
ill chat to bauzas about how we can phase this reminder to make it cleare the focus is not about stoping blind recheck its about understanding why a recheck was required and fixing that issue and providing a reasons is the minimal first step in that process.
It’s unfortunate that you’re triggered by a reminder, but it’s also like the least impactful thing in the meeting. It gets said, and then immediately we move on to the next thing. Could you maybe just read it with the new lens of understanding where it's coming from from now on? —-Dan
On Fri, 2024-03-22 at 07:02 -0700, Dan Smith wrote:
the irritation i was expressing was because previous time we asked peopel not to recheck with out a reason was not implemented as asking people to trying and see if they can fix the underlying problem.
repeating * STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes...
in the gate status section of our meeting every week and not actually trying to understand the reasons people were giving for a recheck was unhelpful and franckly quite annoying.
That document has lots of good examples and suggestions for exactly what we’re trying to do: get people to examine the failures. The most common thing I hear when I confront people about why they’re rechecking is that they “don’t know how to debug this stuff”. So I think pointing them at a document that gives them pointers and places to start is pretty good. yes it does and i agree that inclduing it is good.
but when i saw in our team meeting this week it very triggering as we prviously had agreed to stop doing that as a team
* please avoid bare rechecks (bauzas, 16:12:14) * ACTION: bauzas to tell about the bare rechecks every week in our meeting (bauzas, 16:15:26)
ill chat to bauzas about how we can phase this reminder to make it cleare the focus is not about stoping blind recheck its about understanding why a recheck was required and fixing that issue and providing a reasons is the minimal first step in that process.
It’s unfortunate that you’re triggered by a reminder, but it’s also like the least impactful thing in the meeting. It gets said, and then immediately we move on to the next thing. Could you maybe just read it with the new lens of understanding where it's coming from from now on? sure although the imietly moving on is part of what is triggering. in our gate status section we talk about the active issues that are directly breaking/blocking things i kind of wish if instead of just moving on we had like a subteam update on hopefully the active progress that is been made on adressing the most common probelm
**this is a very nova specific view point so it may not apply to others** instead of #topic Gate status #link check queue gate status http://status.openstack.org/elastic-recheck/index.html #link 3rd party CI status (not so dead) http://ciwatch.mmedvede.net/project?project=nova # tell peopel not to do blind recheck with link and move on we did somethin like #topic Gate status (reminder read before rechecking https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes...) #link check queue gate status http://status.openstack.org/elastic-recheck/index.html #link 3rd party CI status (not so dead) http://ciwatch.mmedvede.net/project?project=nova # talks about most common recheck reason and how we can address it or general status of effort this really feall liek it shoudl be a sig or pop up team to improve the ci helath and we shoudl treat it in the meaing like the Stable branch status. im not agianst talking about it in the meeting, but i would prefer if we tried to engagne in actionable converstaions about this instead of just a reminder. i was hoping to be able to spend more time purly upstream in dalmaiation. unfortunately i likely will have to spend less tiem instead. that means that more then ever i want to try and focus on the activities that will have the largest impact such as improving ci health. The triggering part of this email thread and the team meeitng was it started with the table of rechecks vs bare recheck not the analasys fo the recheck reasosn. if we just take the first sub catagory i have some feedback on how to imporve them that we coudl try and capture in https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes... CI/Testing Issues: Timeouts in various tests (e.g., tempest tests, tox tests). this is somethmes just a slow node or intermint package cloning issue this is genrally less actioanable but when this happens i woudl recommend you note which ci provider the job executed on if we see the timeout only on ovh-whatever then we can reach out to the mainteiners of that provider to see if its a problem they can fix. sometimes this is a issue in the job where its gitclonign somethign that should be listed as a required project and precloned/preapred by zuul. for example i know that the neutorn jobs in the past were cloning the ovs repo adn compiling it form souces a number of years ago in the ci job. i saw this failure once for networking-ovs-dpdk (now unmainated) so i reached out to the infra team to add ovs and dpdk https://github.com/openstack/project-config/commit/3f41d6be143a98d2848c42f9a... to the zuul tenenat and i spoke to the neutron team to let them knwo that this was now aviable but i would not ahve time to update there jobs to use zuul to clone the repo the neutron-functional job still build ovs form git by default https://github.com/openstack/neutron/blob/master/zuul.d/base.yaml#L41 if that has ever timed out or failed to clone that can entirly be eliminated by adding openvswitch/ovs to required project and updateing OVS_REPO to the zuul cloned repo path which is something like ~/src/github.com/openvswitch/ovs there is a standard way to look this up in teh zuul inventory. Infrastructure failures impacting CI. This is fortunetly quite rare the infra team do a very good jobs at avoiding downtime. mostly this only happens when they need to do a do a gerrtit/zuul restart which htey annoch ahead of tiem or if there is a provider issue which is out of our contol. Kernel panics or guest panics during testing. this as i noted previsouly so partly related to using old guest images with know buggy kernels and sometime qemu bugs. this is the thing we have the most difficutly fixing in the nova ci because it largely not issues in our code. but we have tried experiment with diffent images to try and mitigate this. i even went as far as starting to create a replacement guest image based on alpine https://review.opendev.org/c/openstack/diskimage-builder/+/755413/4 . to be clear this is not a cirros probelm i.e. cirros is fundementally broken cirros is using the unmodifed ubuntu kernel so it is a good proxy for real guest workloads moving cirros to the latest upstream lts kernel might help or even the latest ubuntu kernel not the latest ubuntu lts. im hopign that when we move to ubuntu 24.04 some fo the qemu bugs go away and that also help. again not a slignt on ubutnu we just are not using the latest release of qemu in our jobs so im hoping that when we get a newer veriosn it is less likely to have bugs? i can dream right :) the size of our guest and the amount of boots we do in our ci i think is why we might hit some of these issues that the qemu/kernel devs may not see in there own testign. Dependency-related failures (merged or updated). I think there has actully been a zuul change in behavior here that has regressed our usage. what i have observed is zuul no seams to abort jobs for later patches in the serise if one of the eairler patches fails and then report verifed -1. it may aslo report the failure or a parent patch after the fact i have not dug into it. that means that we now need to recheck those later patches in the serise where as before we did not. we could rehceck the base and when it meged the gate jobs for the later patches could progress. im not explainign this very well but there has been a subtle change in behavior during caracal that i did not recal seeing before then so perhaps we shoudl see if a recent zuul release has change something in this regard adn perhaps we can configure the old behaivor which requied less reject. Unexpected failures or intermittent issues. ya so this ususally means there is a bug either a test leaking state a incorrectly written test case PSA if you are writing a tempest test to attach/detach a prot of volume from a nova instace and you are checking neutron or cinder to see if its down your test is wrong. you shoudl be chekcing nova we release the prot/volume before we commit the db update on our side so that we only do that if the external serivice did not have an error. im raising this as an example because we know of at least one tempest test case that fails intermitnetly because of this exact race. i tried to fix that quickly https://review.opendev.org/c/openstack/tempest/+/905130 but have not had time to actully dig into why that did not work adn fi anyone form the tempest team can take that over please do. its probly something trivial like verificaiton is not configure that you will see and i have missed. anyway that very long but if this was the tyep of converstaion we were having in the nova team meeting or a pop-up team setting i woudl be happy to engage.
—-Dan
Le ven. 22 mars 2024 à 14:28, <smooney@redhat.com> a écrit :
Hi,
Dnia czwartek, 21 marca 2024 20:19:18 CET smooney@redhat.com pisze:
im not aware of all th things on the TCs plate right now but this
feels to me
liek somethign we should not be spending time on. the contributor that do most of the work day to day already know this
for new continuator we try to tell them about it when we see bare rechecks but this always feels like we are perching to the choir.
i think it would be better to just stop tracking this and i don't think enforcing this in code is a good thing either. people that don't care will just work around it so unless we are going to soft ban an account for a few days or something like that i dont
On Fri, 2024-03-22 at 13:18 +0100, Sławek Kapłoński wrote: policy. think it will
have much impact.
i dont think we need to take the ban hammer out to people that do bare rechecks but after a few years of advertising this policy now this feels more like sapm to me then actully something that will have a positive effect.
I just want to explain one thing here. It was never my intention to ban anyone or to enforce anything. My email is only to ask people to maybe try to improve those recheck comments a bit. But it's totally fine if people in some project will not do that at all. They still can do bare rechecks if they want to.
well just looping back to dan's reply to my previosus email.
it seams like the intent this time is to understand why recheck are being done not to stop peopel rechecking
the irritation i was expressing was because previous time we asked peopel not to recheck with out a reason was not implemented as asking people to trying and see if they can fix the underlying problem.
repeating * STOP DOING BLIND RECHECKS aka. 'recheck'
https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes...
in the gate status section of our meeting every week and not actually trying to understand the reasons people were giving for a recheck was unhelpful and franckly quite annoying. For reasons for the last 18 month my upstream time has been severly reduced and has almost exclusively been code review and gate fixes. my persecption of the previous incarndation fo this effort in 2022 was the focus was not on actully fixing the problems but trackign the rechecks and reducign them to save ci resouce ranther then improvig ci stablity.
it sound like the intent of the new effort is to actully fix the issues by understanding why peopel are rechecking and discuraging blind rejects is just a side effect to ensure we have good data to take action on.
one of the things that was indetiged in the list was kernel panics in the guest one way to reduce that is if you have a projec ton this list
https://codesearch.opendev.org/?q=cirros-0.5.2&i=nope&literal=nope&files=&excludeFiles=&repos= please move to cirros 6 or at least 0.5.3
cirros 0.5.2 has a know kernel bug that casues random guest panics that is fixed in both 0.5.3 and the 6.x series
i aware of the ironly of a nova core saying that when nova ahse one freence to a 0.5.2 image but that one is for arm and does not (that we have seen) hit the issue thats in the x86 one but its on my todo list to fix.
im all for using recheck reasons as a data soruce to try and identify and fix ci issue
but when i saw in our team meeting this week it very triggering as we prviously had agreed to stop doing that as a team
* please avoid bare rechecks (bauzas, 16:12:14) * ACTION: bauzas to tell about the bare rechecks every week in our meeting (bauzas, 16:15:26)
ill chat to bauzas about how we can phase this reminder to make it cleare the focus is not about stoping blind recheck its about understanding why a recheck was required and fixing that issue and providing a reasons is the minimal first step in that process.
I don't have the exact figures and everything being logged, I'm pretty sure we could eventually find out when and how, but I do remember that before I started to provide this reminder, I explained the reasons behind. FWIW, the whole nova meeting is a collection of reminders (collecting items for the PTG, triaging bugs, incenting review priorities) that I thought that this other reminder wasn't controversial, and I never heard anyone complaining about it. Then I stopped providing this reminder for the exact reason that it worked : our bare recheck numbers were dropping. Now, the TC is pivoting a bit and asking the project leaders to ask the contributors to give better strings for their rechecks. I don't really see it controversial either and I'm open to discuss it on the right media, which is the nova meeting. -Sylvain
maybe im just being cynical but if we make recheck hard people will
work around it by maing a trivial change to the patch or hitting the rebase button in the ui instead and get the same effect...
that my perspective anyway but i dont think this help our comuntity be more welcoming or enjoyable to work with. ci is a share resouce we should not squander but this topic just draing my energy when i see it come up in team meeting or the mailing list.
On Thu, 2024-03-21 at 20:39 +0300, Maksim Malchuk wrote:
Sven, bare rechecks can't be disabled, because it's hard to check if
meaningful reason is provided. Enforcing specify the reason will lead to the commands like "recheck failed" or "recheck lets check" etc.
On Thu, Mar 21, 2024 at 7:10 PM Sławek Kapłoński < skaplons@redhat.com> wrote:
Hi,
Dnia czwartek, 21 marca 2024 16:23:21 CET Jeremy Stanley pisze:
On 2024-03-21 15:56:42 +0100 (+0100), Sven Kieske wrote: [...] > There must be a specific reason why bare rechecks are allowed at all? > Why don't we simply enforce that there always must be a reason given? > > Of course we can't enforce a meaningful reason being stated, but this > is already the case now, so it would not get worse if we just disabled > the possibility for bare rechecks, no? [...]
There was a time when we did exactly that, it lasted several years and the end result did not yield any measurable improvement in data quality. In fact, at one point we got restrictive enough to require bug numbers and the outcome was that people either made up nonexistent bug numbers or just put in any old bug they knew the number for regardless of whether it was related to the failure.
Yes it's been a while so I can't say for certain that the results would be the same if we tried again, but I don't have a good reason to believe it would turn out any different. Also, bear in mind,
just the the
pipeline trigger patterns apply to the entire Zuul tenant used by the OpenStack project, which is currently shared by any other projects outside OpenStack's governance, so if this change were enforced (again) it would disrupt their contributors' workflows as well. -- Jeremy Stanley
I agree with Jeremy here. We know that enforcing don't really work well and that's why we are trying to educate more :)
-- Slawek Kaplonski Principal Software Engineer Red Hat
On Fri, 22 Mar 2024 at 14:37, Sylvain Bauza <sbauza@redhat.com> wrote:
Le ven. 22 mars 2024 à 14:28, <smooney@redhat.com> a écrit :
Hi,
Dnia czwartek, 21 marca 2024 20:19:18 CET smooney@redhat.com pisze:
im not aware of all th things on the TCs plate right now but this
feels to me
liek somethign we should not be spending time on. the contributor that do most of the work day to day already know this
for new continuator we try to tell them about it when we see bare rechecks but this always feels like we are perching to the choir.
i think it would be better to just stop tracking this and i don't think enforcing this in code is a good thing either. people that don't care will just work around it so unless we are going to soft ban an account for a few days or something like that i dont
have much impact.
i dont think we need to take the ban hammer out to people that do bare rechecks but after a few years of advertising this policy now this feels more like sapm to me then actully something that will have a positive effect.
I just want to explain one thing here. It was never my intention to ban anyone or to enforce anything. My email is only to ask people to maybe try to improve those recheck comments a bit. But it's totally fine if people in some project will not do that at all. They still can do bare rechecks if
On Fri, 2024-03-22 at 13:18 +0100, Sławek Kapłoński wrote: policy. think it will they want to.
well just looping back to dan's reply to my previosus email.
it seams like the intent this time is to understand why recheck are being done not to stop peopel rechecking
the irritation i was expressing was because previous time we asked peopel not to recheck with out a reason was not implemented as asking people to trying and see if they can fix the underlying problem.
repeating * STOP DOING BLIND RECHECKS aka. 'recheck'
https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes...
in the gate status section of our meeting every week and not actually trying to understand the reasons people were giving for a recheck was unhelpful and franckly quite annoying. For reasons for the last 18 month my upstream time has been severly reduced and has almost exclusively been code review and gate fixes. my persecption of the previous incarndation fo this effort in 2022 was the focus was not on actully fixing the problems but trackign the rechecks and reducign them to save ci resouce ranther then improvig ci stablity.
it sound like the intent of the new effort is to actully fix the issues by understanding why peopel are rechecking and discuraging blind rejects is just a side effect to ensure we have good data to take action on.
one of the things that was indetiged in the list was kernel panics in the guest one way to reduce that is if you have a projec ton this list
https://codesearch.opendev.org/?q=cirros-0.5.2&i=nope&literal=nope&files=&excludeFiles=&repos= please move to cirros 6 or at least 0.5.3
cirros 0.5.2 has a know kernel bug that casues random guest panics that is fixed in both 0.5.3 and the 6.x series
i aware of the ironly of a nova core saying that when nova ahse one freence to a 0.5.2 image but that one is for arm and does not (that we have seen) hit the issue thats in the x86 one but its on my todo list to fix.
im all for using recheck reasons as a data soruce to try and identify and fix ci issue
but when i saw in our team meeting this week it very triggering as we prviously had agreed to stop doing that as a team
* please avoid bare rechecks (bauzas, 16:12:14) * ACTION: bauzas to tell about the bare rechecks every week in our meeting (bauzas, 16:15:26)
ill chat to bauzas about how we can phase this reminder to make it cleare the focus is not about stoping blind recheck its about understanding why a recheck was required and fixing that issue and providing a reasons is the minimal first step in that process.
I don't have the exact figures and everything being logged, I'm pretty sure we could eventually find out when and how, but I do remember that before I started to provide this reminder, I explained the reasons behind. FWIW, the whole nova meeting is a collection of reminders (collecting items for the PTG, triaging bugs, incenting review priorities) that I thought that this other reminder wasn't controversial, and I never heard anyone complaining about it. Then I stopped providing this reminder for the exact reason that it worked : our bare recheck numbers were dropping.
Now, the TC is pivoting a bit and asking the project leaders to ask the contributors to give better strings for their rechecks. I don't really see it controversial either and I'm open to discuss it on the right media, which is the nova meeting.
-Sylvain
maybe im just being cynical but if we make recheck hard people will
work around it by maing a trivial change to the patch or hitting the rebase button in the ui instead and get the same effect...
that my perspective anyway but i dont think this help our comuntity be more welcoming or enjoyable to work with. ci is a share resouce we should not squander but this topic just draing my energy when i see it come up in team meeting or the mailing list.
On Thu, 2024-03-21 at 20:39 +0300, Maksim Malchuk wrote:
Sven, bare rechecks can't be disabled, because it's hard to check if the meaningful reason is provided. Enforcing specify the reason will lead to the commands like "recheck failed" or "recheck lets check" etc.
On Thu, Mar 21, 2024 at 7:10 PM Sławek Kapłoński < skaplons@redhat.com> wrote:
Hi,
Dnia czwartek, 21 marca 2024 16:23:21 CET Jeremy Stanley pisze: > On 2024-03-21 15:56:42 +0100 (+0100), Sven Kieske wrote: > [...] > > There must be a specific reason why bare rechecks are allowed at all? > > Why don't we simply enforce that there always must be a reason given? > > > > Of course we can't enforce a meaningful reason being stated, but this > > is already the case now, so it would not get worse if we just disabled > > the possibility for bare rechecks, no? > [...] > > There was a time when we did exactly that, it lasted several years > and the end result did not yield any measurable improvement in data > quality. In fact, at one point we got restrictive enough to require > bug numbers and the outcome was that people either made up > nonexistent bug numbers or just put in any old bug they knew the > number for regardless of whether it was related to the failure. > > Yes it's been a while so I can't say for certain that the results > would be the same if we tried again, but I don't have a good reason > to believe it would turn out any different. Also, bear in mind,
just the
> pipeline trigger patterns apply to the entire Zuul tenant used by > the OpenStack project, which is currently shared by any other > projects outside OpenStack's governance, so if this change were > enforced (again) it would disrupt their contributors' workflows as > well. > -- > Jeremy Stanley >
I agree with Jeremy here. We know that enforcing don't really work well and that's why we are trying to educate more :)
-- Slawek Kaplonski Principal Software Engineer Red Hat
Great! Some opinionated discussion and passionate responses, good to see that again. I have one suggestion which, I think, would provide a pretty good indication of how useful effort this is, consuming anyone's time. How about we start record somewhere, even a wiki page (oh wait, I think the wiki is dead, well somewhere anyways), a living document, about the bugs that got fixed due to non-mute recheck messages. I don't care if you write one-liner or descriptive paragraph, but a minimum of date, review link and some kind of explanation about how the recheck message got you to fix this bug and obviously the message itself. Now for those who like to keep the reminder in the team meetings have something to celebrate "Look, this actually works!" but if the page is still empty in a month's or two months' time, I guess we can all agree to focus our efforts on something more constructive and stop pestering people about it. Otherwise we might as well create a gerrit hook that recognizes the empty "recheck" and asks <insert your favourite LLM here> to supplement it with likely reason and get on with it. - jokke
On Fri, 22 Mar 2024 at 15:04, Erno Kuvaja <ekuvaja@redhat.com> wrote:
On Fri, 22 Mar 2024 at 14:37, Sylvain Bauza <sbauza@redhat.com> wrote:
Le ven. 22 mars 2024 à 14:28, <smooney@redhat.com> a écrit :
Hi,
Dnia czwartek, 21 marca 2024 20:19:18 CET smooney@redhat.com pisze:
im not aware of all th things on the TCs plate right now but this
feels to me
liek somethign we should not be spending time on. the contributor that do most of the work day to day already know
for new continuator we try to tell them about it when we see bare rechecks but this always feels like we are perching to the choir.
i think it would be better to just stop tracking this and i don't think enforcing this in code is a good thing either. people that don't care will just work around it so unless we are going to soft ban an account for a few days or something like that i dont
have much impact.
i dont think we need to take the ban hammer out to people that do bare rechecks but after a few years of advertising this policy now this feels more
then actully something that will have a positive effect.
I just want to explain one thing here. It was never my intention to ban anyone or to enforce anything. My email is only to ask people to maybe try to improve those recheck comments a bit. But it's totally fine if people in some project will not do that at all. They still can do bare rechecks if
On Fri, 2024-03-22 at 13:18 +0100, Sławek Kapłoński wrote: this policy. think it will like sapm to me they want to.
well just looping back to dan's reply to my previosus email.
it seams like the intent this time is to understand why recheck are being done not to stop peopel rechecking
the irritation i was expressing was because previous time we asked peopel not to recheck with out a reason was not implemented as asking people to trying and see if they can fix the underlying problem.
repeating * STOP DOING BLIND RECHECKS aka. 'recheck'
https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-tes...
in the gate status section of our meeting every week and not actually trying to understand the reasons people were giving for a recheck was unhelpful and franckly quite annoying. For reasons for the last 18 month my upstream time has been severly reduced and has almost exclusively been code review and gate fixes. my persecption of the previous incarndation fo this effort in 2022 was the focus was not on actully fixing the problems but trackign the rechecks and reducign them to save ci resouce ranther then improvig ci stablity.
it sound like the intent of the new effort is to actully fix the issues by understanding why peopel are rechecking and discuraging blind rejects is just a side effect to ensure we have good data to take action on.
one of the things that was indetiged in the list was kernel panics in the guest one way to reduce that is if you have a projec ton this list
https://codesearch.opendev.org/?q=cirros-0.5.2&i=nope&literal=nope&files=&excludeFiles=&repos= please move to cirros 6 or at least 0.5.3
cirros 0.5.2 has a know kernel bug that casues random guest panics that is fixed in both 0.5.3 and the 6.x series
i aware of the ironly of a nova core saying that when nova ahse one freence to a 0.5.2 image but that one is for arm and does not (that we have seen) hit the issue thats in the x86 one but its on my todo list to fix.
im all for using recheck reasons as a data soruce to try and identify and fix ci issue
but when i saw in our team meeting this week it very triggering as we prviously had agreed to stop doing that as a team
* please avoid bare rechecks (bauzas, 16:12:14) * ACTION: bauzas to tell about the bare rechecks every week in our meeting (bauzas, 16:15:26)
ill chat to bauzas about how we can phase this reminder to make it cleare the focus is not about stoping blind recheck its about understanding why a recheck was required and fixing that issue and providing a reasons is the minimal first step in that process.
I don't have the exact figures and everything being logged, I'm pretty sure we could eventually find out when and how, but I do remember that before I started to provide this reminder, I explained the reasons behind. FWIW, the whole nova meeting is a collection of reminders (collecting items for the PTG, triaging bugs, incenting review priorities) that I thought that this other reminder wasn't controversial, and I never heard anyone complaining about it. Then I stopped providing this reminder for the exact reason that it worked : our bare recheck numbers were dropping.
Now, the TC is pivoting a bit and asking the project leaders to ask the contributors to give better strings for their rechecks. I don't really see it controversial either and I'm open to discuss it on the right media, which is the nova meeting.
-Sylvain
maybe im just being cynical but if we make recheck hard people will
work around it by maing a trivial change to the patch or hitting the rebase button in the ui instead and get the same effect...
that my perspective anyway but i dont think this help our comuntity be more welcoming or enjoyable to work with. ci is a share resouce we should not squander but this topic just draing my energy when i see it come up in team meeting or the mailing list.
On Thu, 2024-03-21 at 20:39 +0300, Maksim Malchuk wrote:
Sven, bare rechecks can't be disabled, because it's hard to check if the meaningful reason is provided. Enforcing specify the reason will lead to the commands like "recheck failed" or "recheck lets check" etc.
On Thu, Mar 21, 2024 at 7:10 PM Sławek Kapłoński < skaplons@redhat.com> wrote:
> Hi, > > Dnia czwartek, 21 marca 2024 16:23:21 CET Jeremy Stanley pisze: > > On 2024-03-21 15:56:42 +0100 (+0100), Sven Kieske wrote: > > [...] > > > There must be a specific reason why bare rechecks are allowed at all? > > > Why don't we simply enforce that there always must be a reason given? > > > > > > Of course we can't enforce a meaningful reason being stated, but this > > > is already the case now, so it would not get worse if we just disabled > > > the possibility for bare rechecks, no? > > [...] > > > > There was a time when we did exactly that, it lasted several years > > and the end result did not yield any measurable improvement in data > > quality. In fact, at one point we got restrictive enough to require > > bug numbers and the outcome was that people either made up > > nonexistent bug numbers or just put in any old bug they knew
just the
> > number for regardless of whether it was related to the failure. > > > > Yes it's been a while so I can't say for certain that the results > > would be the same if we tried again, but I don't have a good reason > > to believe it would turn out any different. Also, bear in mind, the > > pipeline trigger patterns apply to the entire Zuul tenant used by > > the OpenStack project, which is currently shared by any other > > projects outside OpenStack's governance, so if this change were > > enforced (again) it would disrupt their contributors' workflows as > > well. > > -- > > Jeremy Stanley > > > > I agree with Jeremy here. We know that enforcing don't really work well > and that's why we are trying to educate more :) > > -- > Slawek Kaplonski > Principal Software Engineer > Red Hat
Great! Some opinionated discussion and passionate responses, good to see that again.
I have one suggestion which, I think, would provide a pretty good indication of how useful effort this is, consuming anyone's time.
How about we start record somewhere, even a wiki page (oh wait, I think the wiki is dead, well somewhere anyways), a living document, about the bugs that got fixed due to non-mute recheck messages. I don't care if you write one-liner or descriptive paragraph, but a minimum of date, review link and some kind of explanation about how the recheck message got you to fix this bug and obviously the message itself. Now for those who like to keep the reminder in the team meetings have something to celebrate "Look, this actually works!" but if the page is still empty in a month's or two months' time, I guess we can all agree to focus our efforts on something more constructive and stop pestering people about it. Otherwise we might as well create a gerrit hook that recognizes the empty "recheck" and asks <insert your favourite LLM here> to supplement it with likely reason and get on with it.
- jokke
date, review link and -> date, review link to the bugfix and
On 2024-03-22 15:04:47 +0000 (+0000), Erno Kuvaja wrote: [...]
How about we start record somewhere, even a wiki page (oh wait, I think the wiki is dead, well somewhere anyways), a living document, about the bugs that got fixed due to non-mute recheck messages. I don't care if you write one-liner or descriptive paragraph, but a minimum of date, review link and some kind of explanation about how the recheck message got you to fix this bug and obviously the message itself. Now for those who like to keep the reminder in the team meetings have something to celebrate "Look, this actually works!" but if the page is still empty in a month's or two months' time, I guess we can all agree to focus our efforts on something more constructive and stop pestering people about it. Otherwise we might as well create a gerrit hook that recognizes the empty "recheck" and asks <insert your favourite LLM here> to supplement it with likely reason and get on with it.
I think this is missing the point. What we *want* is for people to look at the failure details and try to understand *why* a job failed. If they do that, they're likely to find that they made an actual mistake in their change (it happens! shocking, I know, but not every change is perfect as pushed). It's also possible they'll find, when looking, that the cause of the failure is something they know how to fix, and they'll push up a patch for that. If nothing else, they might actually tell someone about the problem they ran into, and that person may know how to address it. In the process, they'll also gain an increased familiarity with how changes are being tested which may assist them with reaching a better outcome in the future. If we tell people, "don't recheck unless you know your change didn't cause the problem and you can't figure out what did," then a lot of them will respond with "oh I looked" even though they clearly didn't. If we then say, "okay so tell me what the error was" they're more compelled to at least actually take a look and not make assumptions based on no evidence whatsoever. So basically we're skipping the first thing that we actually want them to do but they usually won't by asking them to do something else which has an increased chance of getting them to do the thing we want. Essentially, this is taking a pedagogical approach to the underlying problem. Where I do agree with you is that a good outcome is one in which whatever behavior we attempt to incentivize leads to more bugs being fixed, tests becoming more reliable, and overall code quality improving. What your solution misses is that most of the bugs which are likely to get fixed because of this approach will ideally be fixed by the person who otherwise would have left a recheck comment, and so won't result in any recheck comment at all (and if all goes well, will lead to fewer recheck comments across all our changes). -- Jeremy Stanley
On Fri, 22 Mar 2024 at 17:57, Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2024-03-22 15:04:47 +0000 (+0000), Erno Kuvaja wrote: [...]
How about we start record somewhere, even a wiki page (oh wait, I think the wiki is dead, well somewhere anyways), a living document, about the bugs that got fixed due to non-mute recheck messages. I don't care if you write one-liner or descriptive paragraph, but a minimum of date, review link and some kind of explanation about how the recheck message got you to fix this bug and obviously the message itself. Now for those who like to keep the reminder in the team meetings have something to celebrate "Look, this actually works!" but if the page is still empty in a month's or two months' time, I guess we can all agree to focus our efforts on something more constructive and stop pestering people about it. Otherwise we might as well create a gerrit hook that recognizes the empty "recheck" and asks <insert your favourite LLM here> to supplement it with likely reason and get on with it.
I think this is missing the point. What we *want* is for people to look at the failure details and try to understand *why* a job failed. If they do that, they're likely to find that they made an actual mistake in their change (it happens! shocking, I know, but not every change is perfect as pushed). It's also possible they'll find, when looking, that the cause of the failure is something they know how to fix, and they'll push up a patch for that. If nothing else, they might actually tell someone about the problem they ran into, and that person may know how to address it. In the process, they'll also gain an increased familiarity with how changes are being tested which may assist them with reaching a better outcome in the future.
I really might miss the point as this is what we have tried for 10 years now and it has not worked that great so far. BUT reading Slawek's and Dan's responses above they are saying that the goal is exactly not this. Although that was the goal of demanding bug number at the time somewhere in the history and as explained already that didn't work out so well either.
If we tell people, "don't recheck unless you know your change didn't cause the problem and you can't figure out what did," then a lot of them will respond with "oh I looked" even though they clearly didn't. If we then say, "okay so tell me what the error was" they're more compelled to at least actually take a look and not make assumptions based on no evidence whatsoever. So basically we're skipping the first thing that we actually want them to do but they usually won't by asking them to do something else which has an increased chance of getting them to do the thing we want. Essentially, this is taking a pedagogical approach to the underlying problem.
And this is the part why I proposed the living doc. As the people who do think it makes a difference already does this and the rest might as well just use that LLM hook as long as they are not convinced that what they write after the recheck has any substantial meaning to anything. I'm not saying that the goal here isn't honourable, just saying that this is probably the 5th or so around the loop of this discussion over the past ten years and so far it has not produced the result we'd like to see, so perhaps we could try something different? Unfortunately that "and you can't figure out what did" is a pretty low bar. Sorry, I see myself still doing this as well "recheck" #3 might be just bare when originally job x failed to say timeout, on first recheck it passed but y failed on something not cleaning up neutron network properly and on the second recheck both of the previous ones passed but job z timed out this time and I've been watching that same pattern past 2 weeks (not literally these past two weeks, but general observation) across multiple patches and I still have no idea what causes those failures. Honestly I thought I had used that "unrelated failure" more recently than what the statistics say in total :P
Where I do agree with you is that a good outcome is one in which whatever behavior we attempt to incentivize leads to more bugs being fixed, tests becoming more reliable, and overall code quality improving. What your solution misses is that most of the bugs which are likely to get fixed because of this approach will ideally be fixed by the person who otherwise would have left a recheck comment, and so won't result in any recheck comment at all (and if all goes well, will lead to fewer recheck comments across all our changes).
Perhaps the shout out in Nova's weekly meeting about fixing some gate bug would be enough for some to take the extra time to figure out a common failure on job X. ;) Maybe writing a community goal out of it could do the trick?
-- Jeremy Stanley
participants (8)
-
Dan Smith
-
Erno Kuvaja
-
Jeremy Stanley
-
Maksim Malchuk
-
smooney@redhat.com
-
Sven Kieske
-
Sylvain Bauza
-
Sławek Kapłoński