I agree with your sentiments Stephen. Perhaps we could implement some sort of automation, sending email to owners of failed bumps? But I don't know what would be the best way to go about this. Especially where would we get the information from? On Wed, Sep 18, 2024 at 7:14 PM Stephen Finucane <stephenfin@redhat.com> wrote:
o/
I'll jump straight into it. For those of you not in the loop we're currently going through a situation where it looks like OSC will be held back due to a bug that the Nova gates are highlighted. The actual details aren't hugely important, but what is important is the fact that this bug was first included in the OSC 7.0.0 release, which was released on August 6th. For those of you who are better at mental calendar maths than I am (or already looked at your calendar), you'll know this was over 6 weeks ago, well before the various feature freeze deadlines . and you might ask why did we only spot this last week? That would be because the upper-constraint (henceforth referred to as u-c) bump failed [1] due a different issue, meaning u-c kept us pinned at <= 6.6.1 which in turn meant every CI job *except* the ones running in the OSC gate itself kept testing the old release instead of the thing we were supposed to be released as part of Dalmatian. That only changed last week when 7.1.0 was released and the u-c bump for *that* merged [2], at which point people started shouting and hollering.
Now, there is one other important consideration to factor into this, namely that, due to a flaw in our tooling, we didn't end up cutting a release branch for OSC alongside the rest of the client-library deliverables. Had we done so, we'd have seen 7.1.0 much earlier than we actually did [3]. That 7.1.0 release fixed the initial issue that prevented us releasing 7.0.0. However, this doesn't change the fact that 7.0.0 came out 6 weeks ago and not once did we get to test it outside of OSC's gate, and what's worse, we (the SDK team) were totally oblivious to the fact). Had it been known, we'd have cut a new release as soon as the fix [4] for that issue had merged (August 12th, which for the record is over 5 weeks ago).
This is not the first time this has happened to us with one of our own dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an oslo.db 13.0.0 release very early in that cycle, followed by a number of subsequent releases, only to have none of them end up making their way into u-c due to issues with some services. We then got to the end of the cycle, realized this, and had to do a frenzied mega revert to get us back to a pre-13.x state and allow us to release an oslo.db deliverable for Bobcat. It also happened with Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before undergoing a similar mass revert.
Adding more fuel to this fire, it's not just a problem with our own dependencies either. In fact, the problem is probably worse there. Need I remind people that it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a contributing factor in the oslo.db issue, fwiw). We had to struggle to get Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently can't build docs for many projects on Fedora hosts since we're capping, among many other things, the Pillow library. The list goes on.
Which brings me to my main points, of which there are three. Firstly:
* There needs to be a clear time limit for how long a u-c bump can remain unmerged and a plan for what happens if affected projects have not resolved their issues within that time. There will always be exceptions - no one could rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and libraries can have legit bugs that justify a block - but those exceptions should be well-understood and, as with the bumps themselves, time-limited. A project that needs 4 releases to adapt to a crucial new library version is not a healthy project.
secondly:
* Caps or u-c bump reverts should similarly be clearly time-limited. We should not be defaulting to e.g re-capping Sphinx because your docs build has started to fail and then leaving that cap in place for months. Cap, fix the issue, and remove the cap asap.
and, finally:
* A failing u-c bump for an openstack deliverable should be treated with the highest priority 🚨🚨🚨 and should be something that the corresponding team should be made aware of immediately so they can start coordinating a resolution. We should not have u-c patches languishing for weeks, let alone months, only for there to be a last minute panic like this.
I have ideas on the above (which basically warrant to more stick/[7]) but, perhaps fortunately, this isn't something I can decide on my own. Nor, should I add, is it something I expect the understaffed and oversubscribed release team to be able to do themselves. Instead, I think it's something that the TC and community as a whole need to settle on. So as soon as we get Dalmatian out the door, let's do that.
Cheers, Stephen
[1] https://review.opendev.org/c/openstack/requirements/+/925763 [2] https://review.opendev.org/c/openstack/requirements/+/928948 [3] https://review.opendev.org/c/openstack/releases/+/928838 [4] https://review.opendev.org/c/openstack/requirements/+/883141 [5] https://review.opendev.org/c/openstack/requirements/+/891694 [6] https://review.opendev.org/c/openstack/requirements/+/927102 [7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4