upper-constraints (< $latest) considered harmful
o/ I'll jump straight into it. For those of you not in the loop we're currently going through a situation where it looks like OSC will be held back due to a bug that the Nova gates are highlighted. The actual details aren't hugely important, but what is important is the fact that this bug was first included in the OSC 7.0.0 release, which was released on August 6th. For those of you who are better at mental calendar maths than I am (or already looked at your calendar), you'll know this was over 6 weeks ago, well before the various feature freeze deadlines . and you might ask why did we only spot this last week? That would be because the upper-constraint (henceforth referred to as u-c) bump failed [1] due a different issue, meaning u-c kept us pinned at <= 6.6.1 which in turn meant every CI job *except* the ones running in the OSC gate itself kept testing the old release instead of the thing we were supposed to be released as part of Dalmatian. That only changed last week when 7.1.0 was released and the u-c bump for *that* merged [2], at which point people started shouting and hollering. Now, there is one other important consideration to factor into this, namely that, due to a flaw in our tooling, we didn't end up cutting a release branch for OSC alongside the rest of the client-library deliverables. Had we done so, we'd have seen 7.1.0 much earlier than we actually did [3]. That 7.1.0 release fixed the initial issue that prevented us releasing 7.0.0. However, this doesn't change the fact that 7.0.0 came out 6 weeks ago and not once did we get to test it outside of OSC's gate, and what's worse, we (the SDK team) were totally oblivious to the fact). Had it been known, we'd have cut a new release as soon as the fix [4] for that issue had merged (August 12th, which for the record is over 5 weeks ago). This is not the first time this has happened to us with one of our own dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an oslo.db 13.0.0 release very early in that cycle, followed by a number of subsequent releases, only to have none of them end up making their way into u-c due to issues with some services. We then got to the end of the cycle, realized this, and had to do a frenzied mega revert to get us back to a pre-13.x state and allow us to release an oslo.db deliverable for Bobcat. It also happened with Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before undergoing a similar mass revert. Adding more fuel to this fire, it's not just a problem with our own dependencies either. In fact, the problem is probably worse there. Need I remind people that it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a contributing factor in the oslo.db issue, fwiw). We had to struggle to get Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently can't build docs for many projects on Fedora hosts since we're capping, among many other things, the Pillow library. The list goes on. Which brings me to my main points, of which there are three. Firstly: * There needs to be a clear time limit for how long a u-c bump can remain unmerged and a plan for what happens if affected projects have not resolved their issues within that time. There will always be exceptions - no one could rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and libraries can have legit bugs that justify a block - but those exceptions should be well-understood and, as with the bumps themselves, time-limited. A project that needs 4 releases to adapt to a crucial new library version is not a healthy project. secondly: * Caps or u-c bump reverts should similarly be clearly time-limited. We should not be defaulting to e.g re-capping Sphinx because your docs build has started to fail and then leaving that cap in place for months. Cap, fix the issue, and remove the cap asap. and, finally: * A failing u-c bump for an openstack deliverable should be treated with the highest priority 🚨🚨🚨 and should be something that the corresponding team should be made aware of immediately so they can start coordinating a resolution. We should not have u-c patches languishing for weeks, let alone months, only for there to be a last minute panic like this. I have ideas on the above (which basically warrant to more stick/[7]) but, perhaps fortunately, this isn't something I can decide on my own. Nor, should I add, is it something I expect the understaffed and oversubscribed release team to be able to do themselves. Instead, I think it's something that the TC and community as a whole need to settle on. So as soon as we get Dalmatian out the door, let's do that. Cheers, Stephen [1] https://review.opendev.org/c/openstack/requirements/+/925763 [2] https://review.opendev.org/c/openstack/requirements/+/928948 [3] https://review.opendev.org/c/openstack/releases/+/928838 [4] https://review.opendev.org/c/openstack/requirements/+/883141 [5] https://review.opendev.org/c/openstack/requirements/+/891694 [6] https://review.opendev.org/c/openstack/requirements/+/927102 [7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4
I agree with your sentiments Stephen. Perhaps we could implement some sort of automation, sending email to owners of failed bumps? But I don't know what would be the best way to go about this. Especially where would we get the information from? On Wed, Sep 18, 2024 at 7:14 PM Stephen Finucane <stephenfin@redhat.com> wrote:
o/
I'll jump straight into it. For those of you not in the loop we're currently going through a situation where it looks like OSC will be held back due to a bug that the Nova gates are highlighted. The actual details aren't hugely important, but what is important is the fact that this bug was first included in the OSC 7.0.0 release, which was released on August 6th. For those of you who are better at mental calendar maths than I am (or already looked at your calendar), you'll know this was over 6 weeks ago, well before the various feature freeze deadlines . and you might ask why did we only spot this last week? That would be because the upper-constraint (henceforth referred to as u-c) bump failed [1] due a different issue, meaning u-c kept us pinned at <= 6.6.1 which in turn meant every CI job *except* the ones running in the OSC gate itself kept testing the old release instead of the thing we were supposed to be released as part of Dalmatian. That only changed last week when 7.1.0 was released and the u-c bump for *that* merged [2], at which point people started shouting and hollering.
Now, there is one other important consideration to factor into this, namely that, due to a flaw in our tooling, we didn't end up cutting a release branch for OSC alongside the rest of the client-library deliverables. Had we done so, we'd have seen 7.1.0 much earlier than we actually did [3]. That 7.1.0 release fixed the initial issue that prevented us releasing 7.0.0. However, this doesn't change the fact that 7.0.0 came out 6 weeks ago and not once did we get to test it outside of OSC's gate, and what's worse, we (the SDK team) were totally oblivious to the fact). Had it been known, we'd have cut a new release as soon as the fix [4] for that issue had merged (August 12th, which for the record is over 5 weeks ago).
This is not the first time this has happened to us with one of our own dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an oslo.db 13.0.0 release very early in that cycle, followed by a number of subsequent releases, only to have none of them end up making their way into u-c due to issues with some services. We then got to the end of the cycle, realized this, and had to do a frenzied mega revert to get us back to a pre-13.x state and allow us to release an oslo.db deliverable for Bobcat. It also happened with Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before undergoing a similar mass revert.
Adding more fuel to this fire, it's not just a problem with our own dependencies either. In fact, the problem is probably worse there. Need I remind people that it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a contributing factor in the oslo.db issue, fwiw). We had to struggle to get Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently can't build docs for many projects on Fedora hosts since we're capping, among many other things, the Pillow library. The list goes on.
Which brings me to my main points, of which there are three. Firstly:
* There needs to be a clear time limit for how long a u-c bump can remain unmerged and a plan for what happens if affected projects have not resolved their issues within that time. There will always be exceptions - no one could rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and libraries can have legit bugs that justify a block - but those exceptions should be well-understood and, as with the bumps themselves, time-limited. A project that needs 4 releases to adapt to a crucial new library version is not a healthy project.
secondly:
* Caps or u-c bump reverts should similarly be clearly time-limited. We should not be defaulting to e.g re-capping Sphinx because your docs build has started to fail and then leaving that cap in place for months. Cap, fix the issue, and remove the cap asap.
and, finally:
* A failing u-c bump for an openstack deliverable should be treated with the highest priority 🚨🚨🚨 and should be something that the corresponding team should be made aware of immediately so they can start coordinating a resolution. We should not have u-c patches languishing for weeks, let alone months, only for there to be a last minute panic like this.
I have ideas on the above (which basically warrant to more stick/[7]) but, perhaps fortunately, this isn't something I can decide on my own. Nor, should I add, is it something I expect the understaffed and oversubscribed release team to be able to do themselves. Instead, I think it's something that the TC and community as a whole need to settle on. So as soon as we get Dalmatian out the door, let's do that.
Cheers, Stephen
[1] https://review.opendev.org/c/openstack/requirements/+/925763 [2] https://review.opendev.org/c/openstack/requirements/+/928948 [3] https://review.opendev.org/c/openstack/releases/+/928838 [4] https://review.opendev.org/c/openstack/requirements/+/883141 [5] https://review.opendev.org/c/openstack/requirements/+/891694 [6] https://review.opendev.org/c/openstack/requirements/+/927102 [7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4
On Thu, 2024-09-19 at 10:08 +0200, Jiri Podivin wrote:
I agree with your sentiments Stephen. Perhaps we could implement some sort of automation, sending email to owners of failed bumps? But I don't know what would be the best way to go about this. Especially where would we get the information from? we have release laisons for all deliverable in the releases repo https://github.com/openstack/releases/blob/master/data/release_liaisons.yaml
so we shoudl have a direct mapping for each of the project managed via the release automation https://github.com/openstack/governance/blob/master/reference/projects.yaml if we have a gap im sure that coudl be fixed, those would be the data souces i would default too.
On Wed, Sep 18, 2024 at 7:14 PM Stephen Finucane <stephenfin@redhat.com> wrote:
o/
I'll jump straight into it. For those of you not in the loop we're currently going through a situation where it looks like OSC will be held back due to a bug that the Nova gates are highlighted. The actual details aren't hugely important, but what is important is the fact that this bug was first included in the OSC 7.0.0 release, which was released on August 6th. For those of you who are better at mental calendar maths than I am (or already looked at your calendar), you'll know this was over 6 weeks ago, well before the various feature freeze deadlines . and you might ask why did we only spot this last week? That would be because the upper-constraint (henceforth referred to as u-c) bump failed [1] due a different issue, meaning u-c kept us pinned at <= 6.6.1 which in turn meant every CI job *except* the ones running in the OSC gate itself kept testing the old release instead of the thing we were supposed to be released as part of Dalmatian. That only changed last week when 7.1.0 was released and the u-c bump for *that* merged [2], at which point people started shouting and hollering.
Now, there is one other important consideration to factor into this, namely that, due to a flaw in our tooling, we didn't end up cutting a release branch for OSC alongside the rest of the client-library deliverables. Had we done so, we'd have seen 7.1.0 much earlier than we actually did [3]. That 7.1.0 release fixed the initial issue that prevented us releasing 7.0.0. However, this doesn't change the fact that 7.0.0 came out 6 weeks ago and not once did we get to test it outside of OSC's gate, and what's worse, we (the SDK team) were totally oblivious to the fact). Had it been known, we'd have cut a new release as soon as the fix [4] for that issue had merged (August 12th, which for the record is over 5 weeks ago).
This is not the first time this has happened to us with one of our own dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an oslo.db 13.0.0 release very early in that cycle, followed by a number of subsequent releases, only to have none of them end up making their way into u-c due to issues with some services. We then got to the end of the cycle, realized this, and had to do a frenzied mega revert to get us back to a pre-13.x state and allow us to release an oslo.db deliverable for Bobcat. It also happened with Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before undergoing a similar mass revert.
Adding more fuel to this fire, it's not just a problem with our own dependencies either. In fact, the problem is probably worse there. Need I remind people that it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a contributing factor in the oslo.db issue, fwiw). We had to struggle to get Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently can't build docs for many projects on Fedora hosts since we're capping, among many other things, the Pillow library. The list goes on.
Which brings me to my main points, of which there are three. Firstly:
 * There needs to be a clear time limit for how long a u-c bump can remain   unmerged and a plan for what happens if affected projects have not resolved   their issues within that time. There will always be exceptions - no one could   rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and   libraries can have legit bugs that justify a block - but those exceptions   should be well-understood and, as with the bumps themselves, time-limited. A   project that needs 4 releases to adapt to a crucial new library version is   not a healthy project.
secondly:
 * Caps or u-c bump reverts should similarly be clearly time-limited. We should   not be defaulting to e.g re-capping Sphinx because your docs build has   started to fail and then leaving that cap in place for months. Cap, fix the   issue, and remove the cap asap.
and, finally:
 * A failing u-c bump for an openstack deliverable should be treated with the   highest priority 🚨🚨🚨 and should be something that the corresponding team   should be made aware of immediately so they can start coordinating a   resolution. We should not have u-c patches languishing for weeks, let alone   months, only for there to be a last minute panic like this.
I have ideas on the above (which basically warrant to more stick/[7]) but, perhaps fortunately, this isn't something I can decide on my own. Nor, should I add, is it something I expect the understaffed and oversubscribed release team to be able to do themselves. Instead, I think it's something that the TC and community as a whole need to settle on. So as soon as we get Dalmatian out the door, let's do that.
Cheers, Stephen
[1] https://review.opendev.org/c/openstack/requirements/+/925763 [2] https://review.opendev.org/c/openstack/requirements/+/928948 [3] https://review.opendev.org/c/openstack/releases/+/928838 [4] https://review.opendev.org/c/openstack/requirements/+/883141 [5] https://review.opendev.org/c/openstack/requirements/+/891694 [6] https://review.opendev.org/c/openstack/requirements/+/927102 [7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4
Thanks Stephen for posting this. There is clearly a gap in our processes -- while the release team has a well-established process[1] to watch release failures, catch issues and raise them with people that can fix them, the requirements team is mostly a single volunteer at this point so it's not a surprise failures can fall between the cracks. In most of the cases, requirements things just end up working without strong oversight, as people will raise issues when they can get what they need out of it. But there are a few edge cases, like a failed u-c bump toward the end of the release cycle, where it can cascade into a big problem. Based on my (some would say way too long) experience with the release team, it's not enough to automatically raise flags, like sending an email to the team liaison in case a certain job fails. You need to have a team regularly meeting, following on those issues and making sure they are fixed. The team might not be the one fixing the issue, but they are the ones keeping track of it and making sure it's fixed. Ideally that would mean staffing up the requirements team. Just looking at the requirements-core group will show that it's in dire need of fresh blood. Even if the release team has acted as a safety net for urgent requirements change reviews in the past, I'm reluctant to propose a merge between the requirements and release teams as the reviews end up being pretty different, even if the core job (herding cats) is the same. But I agree it would be good to solve that gap once and for all because as you noticed it's not the first time that it bites us all. Thierry smooney@redhat.com wrote:
On Thu, 2024-09-19 at 10:08 +0200, Jiri Podivin wrote:
I agree with your sentiments Stephen. Perhaps we could implement some sort of automation, sending email to owners of failed bumps? But I don't know what would be the best way to go about this. Especially where would we get the information from? we have release laisons for all deliverable in the releases repo https://github.com/openstack/releases/blob/master/data/release_liaisons.yaml
so we shoudl have a direct mapping for each of the project managed via the release automation https://github.com/openstack/governance/blob/master/reference/projects.yaml
if we have a gap im sure that coudl be fixed, those would be the data souces i would default too.
On Wed, Sep 18, 2024 at 7:14 PM Stephen Finucane <stephenfin@redhat.com> wrote:
o/
I'll jump straight into it. For those of you not in the loop we're currently going through a situation where it looks like OSC will be held back due to a bug that the Nova gates are highlighted. The actual details aren't hugely important, but what is important is the fact that this bug was first included in the OSC 7.0.0 release, which was released on August 6th. For those of you who are better at mental calendar maths than I am (or already looked at your calendar), you'll know this was over 6 weeks ago, well before the various feature freeze deadlines . and you might ask why did we only spot this last week? That would be because the upper-constraint (henceforth referred to as u-c) bump failed [1] due a different issue, meaning u-c kept us pinned at <= 6.6.1 which in turn meant every CI job *except* the ones running in the OSC gate itself kept testing the old release instead of the thing we were supposed to be released as part of Dalmatian. That only changed last week when 7.1.0 was released and the u-c bump for *that* merged [2], at which point people started shouting and hollering.
Now, there is one other important consideration to factor into this, namely that, due to a flaw in our tooling, we didn't end up cutting a release branch for OSC alongside the rest of the client-library deliverables. Had we done so, we'd have seen 7.1.0 much earlier than we actually did [3]. That 7.1.0 release fixed the initial issue that prevented us releasing 7.0.0. However, this doesn't change the fact that 7.0.0 came out 6 weeks ago and not once did we get to test it outside of OSC's gate, and what's worse, we (the SDK team) were totally oblivious to the fact). Had it been known, we'd have cut a new release as soon as the fix [4] for that issue had merged (August 12th, which for the record is over 5 weeks ago).
This is not the first time this has happened to us with one of our own dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an oslo.db 13.0.0 release very early in that cycle, followed by a number of subsequent releases, only to have none of them end up making their way into u-c due to issues with some services. We then got to the end of the cycle, realized this, and had to do a frenzied mega revert to get us back to a pre-13.x state and allow us to release an oslo.db deliverable for Bobcat. It also happened with Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before undergoing a similar mass revert.
Adding more fuel to this fire, it's not just a problem with our own dependencies either. In fact, the problem is probably worse there. Need I remind people that it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a contributing factor in the oslo.db issue, fwiw). We had to struggle to get Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently can't build docs for many projects on Fedora hosts since we're capping, among many other things, the Pillow library. The list goes on.
Which brings me to my main points, of which there are three. Firstly:
 * There needs to be a clear time limit for how long a u-c bump can remain   unmerged and a plan for what happens if affected projects have not resolved   their issues within that time. There will always be exceptions - no one could   rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and   libraries can have legit bugs that justify a block - but those exceptions   should be well-understood and, as with the bumps themselves, time-limited. A   project that needs 4 releases to adapt to a crucial new library version is   not a healthy project.
secondly:
 * Caps or u-c bump reverts should similarly be clearly time-limited. We should   not be defaulting to e.g re-capping Sphinx because your docs build has   started to fail and then leaving that cap in place for months. Cap, fix the   issue, and remove the cap asap.
and, finally:
 * A failing u-c bump for an openstack deliverable should be treated with the   highest priority 🚨🚨🚨 and should be something that the corresponding team   should be made aware of immediately so they can start coordinating a   resolution. We should not have u-c patches languishing for weeks, let alone   months, only for there to be a last minute panic like this.
I have ideas on the above (which basically warrant to more stick/[7]) but, perhaps fortunately, this isn't something I can decide on my own. Nor, should I add, is it something I expect the understaffed and oversubscribed release team to be able to do themselves. Instead, I think it's something that the TC and community as a whole need to settle on. So as soon as we get Dalmatian out the door, let's do that.
Cheers, Stephen
[1] https://review.opendev.org/c/openstack/requirements/+/925763 [2] https://review.opendev.org/c/openstack/requirements/+/928948 [3] https://review.opendev.org/c/openstack/releases/+/928838 [4] https://review.opendev.org/c/openstack/requirements/+/883141 [5] https://review.opendev.org/c/openstack/requirements/+/891694 [6] https://review.opendev.org/c/openstack/requirements/+/927102 [7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4
On 2024-09-19 15:37:03 +0200 (+0200), Thierry Carrez wrote: [...]
Even if the release team has acted as a safety net for urgent requirements change reviews in the past, I'm reluctant to propose a merge between the requirements and release teams as the reviews end up being pretty different, even if the core job (herding cats) is the same. [...]
Similarly the OpenStack TaCT SIG (mostly the OpenDev Sysadmins in a different set of hats) is granted approval rights on openstack/requirements changes, but we generally only exercise those in emergency situations. -- Jeremy Stanley
On 9/18/24 19:13, Stephen Finucane wrote:
o/
I'll jump straight into it. [...] I very much agree with all you wrote. Though a few remarks...
Some project are still not under SQLAlchemy 2.x, like Zaqar, and I'm not sure for Trove (it switched to Alembic, and seems ok with SQLA 2.x: no unit test error that I could see). From the release perspective, blocking OSC to 6.6.0 is ok for the moment, and it is my point of view that we should do it ASAP, so the release can move forward. We can repair the 7.x branch later, hopefully before the final releases. What you describe is a general problem that started since we began using pinned versions for all libs. It could help to run a (non-voting) job that wouldn't have this constraint. In a more general way: making sure nothing is broken with the latest components is always a good thing. Cheers, Thomas Goirand (zigo)
First of all, thank you, Stephen, for triggering this discussion. (and of cause your hard work to keep things up) On 9/19/24 02:13, Stephen Finucane wrote:
o/
(snip)
This is not the first time this has happened to us with one of our own dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an oslo.db 13.0.0 release very early in that cycle, followed by a number of subsequent releases, only to have none of them end up making their way into u-c due to issues with some services. We then got to the end of the cycle, realized this, and had to do a frenzied mega revert to get us back to a pre-13.x state and allow us to release an oslo.db deliverable for Bobcat. It also happened with Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before undergoing a similar mass revert.
I know we already have clear consensus about this fact, but I'd mention that we did have again same pain during Caracal RC phase, too. The situation was worse than Bobcat I'd say because most of the problems detected during Bobcat release timeing, which were "resolved" by massive revert, was kept unfixed ann we again had to deal with at the very end of the release. This clearly explain the ignorance to the compatibility problem to latest libs, IMHO.
Adding more fuel to this fire, it's not just a problem with our own dependencies either. In fact, the problem is probably worse there. Need I remind people that it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a contributing factor in the oslo.db issue, fwiw). We had to struggle to get Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently can't build docs for many projects on Fedora hosts since we're capping, among many other things, the Pillow library. The list goes on.
Which brings me to my main points, of which there are three. Firstly:
* There needs to be a clear time limit for how long a u-c bump can remain unmerged and a plan for what happens if affected projects have not resolved their issues within that time. There will always be exceptions - no one could rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and libraries can have legit bugs that justify a block - but those exceptions should be well-understood and, as with the bumps themselves, time-limited. A project that needs 4 releases to adapt to a crucial new library version is not a healthy project.
+1 If we can establish the workflow to notify project teams for failures in the cross jobs relevant to them, we can set the hard limit and then make the job non-voting and push the u-c bump forcefully, given the fact the whole workflow was sometimes blocked by a very small number of inactive projects in the past. I may also propose we have shorter limit for internal dependencies (dependencies maintained in OpenStack) than external dependencies to avoid problems with our own release process but that can discuss the actual duration separately.
secondly:
* Caps or u-c bump reverts should similarly be clearly time-limited. We should not be defaulting to e.g re-capping Sphinx because your docs build has started to fail and then leaving that cap in place for months. Cap, fix the issue, and remove the cap asap.
+1
and, finally:
* A failing u-c bump for an openstack deliverable should be treated with the highest priority 🚨🚨🚨 and should be something that the corresponding team should be made aware of immediately so they can start coordinating a resolution. We should not have u-c patches languishing for weeks, let alone months, only for there to be a last minute panic like this.
+1 Making the project team aware of cross job failures timely is the definitely the good first step. I'd admit the fact that these jobs are not very visible to the project teams. I myself became more aware of the status after multiple heat-related issues caught in last-minite phases in the past. I know that the requirement team doesn't have plenty of resources, so it'd be very much helpful if we can automate the process instead asking someone to "report a bug for us".
I have ideas on the above (which basically warrant to more stick/[7]) but, perhaps fortunately, this isn't something I can decide on my own. Nor, should I add, is it something I expect the understaffed and oversubscribed release team to be able to do themselves. Instead, I think it's something that the TC and community as a whole need to settle on. So as soon as we get Dalmatian out the door, let's do that.
Cheers, Stephen
[1] https://review.opendev.org/c/openstack/requirements/+/925763 [2] https://review.opendev.org/c/openstack/requirements/+/928948 [3] https://review.opendev.org/c/openstack/releases/+/928838 [4] https://review.opendev.org/c/openstack/requirements/+/883141 [5] https://review.opendev.org/c/openstack/requirements/+/891694 [6] https://review.opendev.org/c/openstack/requirements/+/927102 [7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4
On 18/09/2024 18:13, Stephen Finucane wrote:
o/
I'll jump straight into it. For those of you not in the loop we're currently going through a situation where it looks like OSC will be held back due to a bug that the Nova gates are highlighted. The actual details aren't hugely important, but what is important is the fact that this bug was first included in the OSC 7.0.0 release, which was released on August 6th. For those of you who are better at mental calendar maths than I am (or already looked at your calendar), you'll know this was over 6 weeks ago, well before the various feature freeze deadlines . and you might ask why did we only spot this last week? That would be because the upper-constraint (henceforth referred to as u-c) bump failed [1] due a different issue, meaning u-c kept us pinned at <= 6.6.1 which in turn meant every CI job *except* the ones running in the OSC gate itself kept testing the old release instead of the thing we were supposed to be released as part of Dalmatian. That only changed last week when 7.1.0 was released and the u-c bump for *that* merged [2], at which point people started shouting and hollering.
Now, there is one other important consideration to factor into this, namely that, due to a flaw in our tooling, we didn't end up cutting a release branch for OSC alongside the rest of the client-library deliverables. Had we done so, we'd have seen 7.1.0 much earlier than we actually did [3]. That 7.1.0 release fixed the initial issue that prevented us releasing 7.0.0. However, this doesn't change the fact that 7.0.0 came out 6 weeks ago and not once did we get to test it outside of OSC's gate, and what's worse, we (the SDK team) were totally oblivious to the fact). Had it been known, we'd have cut a new release as soon as the fix [4] for that issue had merged (August 12th, which for the record is over 5 weeks ago).
This is not the first time this has happened to us with one of our own dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an oslo.db 13.0.0 release very early in that cycle, followed by a number of subsequent releases, only to have none of them end up making their way into u-c due to issues with some services. We then got to the end of the cycle, realized this, and had to do a frenzied mega revert to get us back to a pre-13.x state and allow us to release an oslo.db deliverable for Bobcat. It also happened with Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before undergoing a similar mass revert.
Adding more fuel to this fire, it's not just a problem with our own dependencies either. In fact, the problem is probably worse there. Need I remind people that it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a contributing factor in the oslo.db issue, fwiw). We had to struggle to get Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently can't build docs for many projects on Fedora hosts since we're capping, among many other things, the Pillow library. The list goes on.
Which brings me to my main points, of which there are three. Firstly:
* There needs to be a clear time limit for how long a u-c bump can remain unmerged and a plan for what happens if affected projects have not resolved their issues within that time. There will always be exceptions - no one could rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and libraries can have legit bugs that justify a block - but those exceptions should be well-understood and, as with the bumps themselves, time-limited. A project that needs 4 releases to adapt to a crucial new library version is not a healthy project.
secondly:
* Caps or u-c bump reverts should similarly be clearly time-limited. We should not be defaulting to e.g re-capping Sphinx because your docs build has started to fail and then leaving that cap in place for months. Cap, fix the issue, and remove the cap asap.
and, finally:
* A failing u-c bump for an openstack deliverable should be treated with the highest priority 🚨🚨🚨 and should be something that the corresponding team should be made aware of immediately so they can start coordinating a resolution. We should not have u-c patches languishing for weeks, let alone months, only for there to be a last minute panic like this.
I have ideas on the above (which basically warrant to more stick/[7]) but, perhaps fortunately, this isn't something I can decide on my own. Nor, should I add, is it something I expect the understaffed and oversubscribed release team to be able to do themselves. Instead, I think it's something that the TC and community as a whole need to settle on. So as soon as we get Dalmatian out the door, let's do that.
Cheers, Stephen
[1] https://review.opendev.org/c/openstack/requirements/+/925763 [2] https://review.opendev.org/c/openstack/requirements/+/928948 [3] https://review.opendev.org/c/openstack/releases/+/928838 [4] https://review.opendev.org/c/openstack/requirements/+/883141 [5] https://review.opendev.org/c/openstack/requirements/+/891694 [6] https://review.opendev.org/c/openstack/requirements/+/927102 [7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4
Hi all, I'm aware that this might be unpopular opinion but I'll air it anyways. Part of the problem seems to be, like Stephen well flagged, that it's very easy to miss the u-c bump failing. Perhaps it would make sense to take a step back on our release process and stop the bot submitting the bumps. Lets put that back as a manual task for the release liaison/PTL as part of the new release of the component. That way the bump has a real life owner who is invested to monitor that it actually merges and the new release gets adopted. Automating all the boring things is very popular approach, especially when we have shrinking community. But as there is no-one responsible or invested in to look after the requirements, well we see the interest to participate on the requirements team size. If we PTLs and release Liaisons would take active role to look after our requirements again, I do believe we would have healthier environment to work on and more eyes on the failures. - Erno "jokke" Kuvaja
On Fri, 2024-09-20 at 13:27 +0100, Erno Kuvaja wrote:
Hi all,
I'm aware that this might be unpopular opinion but I'll air it anyways.
Part of the problem seems to be, like Stephen well flagged, that it's very easy to miss the u-c bump failing. Perhaps it would make sense to take a step back on our release process and stop the bot submitting the bumps. Lets put that back as a manual task for the release liaison/PTL as part of the new release of the component.
i think that would make sense for openstack deliverable but for thrid party deps the bot should still do this.
That way the bump has a real life owner who is invested to monitor that it actually merges and the new release gets adopted. Automating all the boring things is very popular approach, especially when we have shrinking community. But as there is no-one responsible or invested in to look after the requirements, well we see the interest to participate on the requirements team size.
doing the uc bump is actully prety autroamted even without the bot you can just use the tox env to generate teh bump and just write your commit mesasge thats all the bot does anyway.
If we PTLs and release Liaisons would take active role to look after our requirements again, I do believe we would have healthier environment to work on and more eyes on the failures.
i dont think we actully even need ti to be the ptl or release liason though i agree they would make the most sense as teh default but i think any core team member should be able to propose the bump.
- Erno "jokke" Kuvaja
On Fri, 2024-09-20 at 13:27 +0100, Erno Kuvaja wrote:
On 18/09/2024 18:13, Stephen Finucane wrote:
o/
I'll jump straight into it. For those of you not in the loop we're currently going through a situation where it looks like OSC will be held back due to a bug that the Nova gates are highlighted. The actual details aren't hugely important, but what is important is the fact that this bug was first included in the OSC 7.0.0 release, which was released on August 6th. For those of you who are better at mental calendar maths than I am (or already looked at your calendar), you'll know this was over 6 weeks ago, well before the various feature freeze deadlines . and you might ask why did we only spot this last week? That would be because the upper-constraint (henceforth referred to as u-c) bump failed [1] due a different issue, meaning u-c kept us pinned at <= 6.6.1 which in turn meant every CI job *except* the ones running in the OSC gate itself kept testing the old release instead of the thing we were supposed to be released as part of Dalmatian. That only changed last week when 7.1.0 was released and the u-c bump for *that* merged [2], at which point people started shouting and hollering.
Now, there is one other important consideration to factor into this, namely that, due to a flaw in our tooling, we didn't end up cutting a release branch for OSC alongside the rest of the client-library deliverables. Had we done so, we'd have seen 7.1.0 much earlier than we actually did [3]. That 7.1.0 release fixed the initial issue that prevented us releasing 7.0.0. However, this doesn't change the fact that 7.0.0 came out 6 weeks ago and not once did we get to test it outside of OSC's gate, and what's worse, we (the SDK team) were totally oblivious to the fact). Had it been known, we'd have cut a new release as soon as the fix [4] for that issue had merged (August 12th, which for the record is over 5 weeks ago).
This is not the first time this has happened to us with one of our own dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an oslo.db 13.0.0 release very early in that cycle, followed by a number of subsequent releases, only to have none of them end up making their way into u-c due to issues with some services. We then got to the end of the cycle, realized this, and had to do a frenzied mega revert to get us back to a pre-13.x state and allow us to release an oslo.db deliverable for Bobcat. It also happened with Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before undergoing a similar mass revert.
Adding more fuel to this fire, it's not just a problem with our own dependencies either. In fact, the problem is probably worse there. Need I remind people that it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a contributing factor in the oslo.db issue, fwiw). We had to struggle to get Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently can't build docs for many projects on Fedora hosts since we're capping, among many other things, the Pillow library. The list goes on.
Which brings me to my main points, of which there are three. Firstly:
* There needs to be a clear time limit for how long a u-c bump can remain unmerged and a plan for what happens if affected projects have not resolved their issues within that time. There will always be exceptions - no one could rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and libraries can have legit bugs that justify a block - but those exceptions should be well-understood and, as with the bumps themselves, time-limited. A project that needs 4 releases to adapt to a crucial new library version is not a healthy project.
secondly:
* Caps or u-c bump reverts should similarly be clearly time-limited. We should not be defaulting to e.g re-capping Sphinx because your docs build has started to fail and then leaving that cap in place for months. Cap, fix the issue, and remove the cap asap.
and, finally:
* A failing u-c bump for an openstack deliverable should be treated with the highest priority 🚨🚨🚨 and should be something that the corresponding team should be made aware of immediately so they can start coordinating a resolution. We should not have u-c patches languishing for weeks, let alone months, only for there to be a last minute panic like this.
I have ideas on the above (which basically warrant to more stick/[7]) but, perhaps fortunately, this isn't something I can decide on my own. Nor, should I add, is it something I expect the understaffed and oversubscribed release team to be able to do themselves. Instead, I think it's something that the TC and community as a whole need to settle on. So as soon as we get Dalmatian out the door, let's do that.
Cheers, Stephen
[1] https://review.opendev.org/c/openstack/requirements/+/925763 [2] https://review.opendev.org/c/openstack/requirements/+/928948 [3] https://review.opendev.org/c/openstack/releases/+/928838 [4] https://review.opendev.org/c/openstack/requirements/+/883141 [5] https://review.opendev.org/c/openstack/requirements/+/891694 [6] https://review.opendev.org/c/openstack/requirements/+/927102 [7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4
Hi all,
I'm aware that this might be unpopular opinion but I'll air it anyways.
Part of the problem seems to be, like Stephen well flagged, that it's very easy to miss the u-c bump failing. Perhaps it would make sense to take a step back on our release process and stop the bot submitting the bumps. Lets put that back as a manual task for the release liaison/PTL as part of the new release of the component. That way the bump has a real life owner who is invested to monitor that it actually merges and the new release gets adopted. Automating all the boring things is very popular approach, especially when we have shrinking community. But as there is no-one responsible or invested in to look after the requirements, well we see the interest to participate on the requirements team size.
I think this is the complete opposite of what we want to do :) Cutting the release is already done mostly manually (albeit with the help of tooling, and with the occasional milestone-derived releases being cut for). Requiring teams to bump u-c manually would just make that process more laborious and drawn-out, which in turn will make it more likely that it won't happen (or, rather, will happen much closer to the end of the cycle when it's much harder to resolve issues). And that's just for the OpenStack components. If we were to do this for all dependencies, what are the realistic chances that someone from e.g. the Manila team would decide to proactively bump Sphinx, absent a fire? All that will end up happening is that our package versions in upper-constraints will drift further and further from $latest, and the work required to correct this will simply pile up waiting for the most inopportune moment to rear its head. If I were to look for a drastic option, I'd actually go the other way and do more automation. Merge the upper-constraint patches as soon as they're generated - sans-CI - and delegate to the individual projects to resolve issues their CIs highlight. This is clearly not a sensible option either as while it would highlight things like deprecations and API changes much faster (and act as a forcing function towards finding resolutions) the sad reality is that consumed projects do occasionally introduce bugs that may warrant a release being excluded. Back to the main point, IMO there's nothing wrong with the u-c bump being automatically proposed. Rather, the issue as I see it is that when it fails, we have no way to know beyond checking manually or someone notifying us. The most obvious point for me is to get those projects in the loop sooner rather than later so that they can start working on a fix. If oslo.db X.Y causes Nova failures, both the Nova and Oslo team should know so they can start coordinating a fix. If Sphinx N.M causes docs failures for Manila, the Manila team should know (it's not really the Sphinx team's problem unless we figure out it's a bug) so they can investigate. In both cases, those investigations should be time constrained to ensure we don't get to the end of the cycle and end up in a panic. Hope this helps, Stephen
If we PTLs and release Liaisons would take active role to look after our requirements again, I do believe we would have healthier environment to work on and more eyes on the failures.
- Erno "jokke" Kuvaja
participants (8)
-
Erno Kuvaja
-
Jeremy Stanley
-
Jiri Podivin
-
smooneyï¼ redhat.com
-
Stephen Finucane
-
Takashi Kajinami
-
Thierry Carrez
-
Thomas Goirand