First of all, thank you, Stephen, for triggering this discussion. (and of cause your hard work to keep things up) On 9/19/24 02:13, Stephen Finucane wrote:
o/
(snip)
This is not the first time this has happened to us with one of our own dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an oslo.db 13.0.0 release very early in that cycle, followed by a number of subsequent releases, only to have none of them end up making their way into u-c due to issues with some services. We then got to the end of the cycle, realized this, and had to do a frenzied mega revert to get us back to a pre-13.x state and allow us to release an oslo.db deliverable for Bobcat. It also happened with Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before undergoing a similar mass revert.
I know we already have clear consensus about this fact, but I'd mention that we did have again same pain during Caracal RC phase, too. The situation was worse than Bobcat I'd say because most of the problems detected during Bobcat release timeing, which were "resolved" by massive revert, was kept unfixed ann we again had to deal with at the very end of the release. This clearly explain the ignorance to the compatibility problem to latest libs, IMHO.
Adding more fuel to this fire, it's not just a problem with our own dependencies either. In fact, the problem is probably worse there. Need I remind people that it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a contributing factor in the oslo.db issue, fwiw). We had to struggle to get Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently can't build docs for many projects on Fedora hosts since we're capping, among many other things, the Pillow library. The list goes on.
Which brings me to my main points, of which there are three. Firstly:
* There needs to be a clear time limit for how long a u-c bump can remain unmerged and a plan for what happens if affected projects have not resolved their issues within that time. There will always be exceptions - no one could rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and libraries can have legit bugs that justify a block - but those exceptions should be well-understood and, as with the bumps themselves, time-limited. A project that needs 4 releases to adapt to a crucial new library version is not a healthy project.
+1 If we can establish the workflow to notify project teams for failures in the cross jobs relevant to them, we can set the hard limit and then make the job non-voting and push the u-c bump forcefully, given the fact the whole workflow was sometimes blocked by a very small number of inactive projects in the past. I may also propose we have shorter limit for internal dependencies (dependencies maintained in OpenStack) than external dependencies to avoid problems with our own release process but that can discuss the actual duration separately.
secondly:
* Caps or u-c bump reverts should similarly be clearly time-limited. We should not be defaulting to e.g re-capping Sphinx because your docs build has started to fail and then leaving that cap in place for months. Cap, fix the issue, and remove the cap asap.
+1
and, finally:
* A failing u-c bump for an openstack deliverable should be treated with the highest priority 🚨🚨🚨 and should be something that the corresponding team should be made aware of immediately so they can start coordinating a resolution. We should not have u-c patches languishing for weeks, let alone months, only for there to be a last minute panic like this.
+1 Making the project team aware of cross job failures timely is the definitely the good first step. I'd admit the fact that these jobs are not very visible to the project teams. I myself became more aware of the status after multiple heat-related issues caught in last-minite phases in the past. I know that the requirement team doesn't have plenty of resources, so it'd be very much helpful if we can automate the process instead asking someone to "report a bug for us".
I have ideas on the above (which basically warrant to more stick/[7]) but, perhaps fortunately, this isn't something I can decide on my own. Nor, should I add, is it something I expect the understaffed and oversubscribed release team to be able to do themselves. Instead, I think it's something that the TC and community as a whole need to settle on. So as soon as we get Dalmatian out the door, let's do that.
Cheers, Stephen
[1] https://review.opendev.org/c/openstack/requirements/+/925763 [2] https://review.opendev.org/c/openstack/requirements/+/928948 [3] https://review.opendev.org/c/openstack/releases/+/928838 [4] https://review.opendev.org/c/openstack/requirements/+/883141 [5] https://review.opendev.org/c/openstack/requirements/+/891694 [6] https://review.opendev.org/c/openstack/requirements/+/927102 [7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4