I agree with your sentiments Stephen. Perhaps we could implement some sort of automation, sending email to owners of failed bumps?
But I don't know what would be the best way to go about this. Especially where would we get the information from?

On Wed, Sep 18, 2024 at 7:14 PM Stephen Finucane <stephenfin@redhat.com> wrote:
o/

I'll jump straight into it. For those of you not in the loop we're currently
going through a situation where it looks like OSC will be held back due to a bug
that the Nova gates are highlighted. The actual details aren't hugely important,
but what is important is the fact that this bug was first included in the OSC
7.0.0 release, which was released on August 6th. For those of you who are better
at mental calendar maths than I am (or already looked at your calendar), you'll
know this was over 6 weeks ago, well before the various feature freeze deadlines
. and you might ask why did we only spot this last week? That would be because
the upper-constraint (henceforth referred to as u-c) bump failed [1] due a
different issue, meaning u-c kept us pinned at <= 6.6.1 which in turn meant
every CI job *except* the ones running in the OSC gate itself kept testing the
old release instead of the thing we were supposed to be released as part of
Dalmatian. That only changed last week when 7.1.0 was released and the u-c bump
for *that* merged [2], at which point people started shouting and hollering.

Now, there is one other important consideration to factor into this, namely
that, due to a flaw in our tooling, we didn't end up cutting a release branch
for OSC alongside the rest of the client-library deliverables. Had we done so,
we'd have seen 7.1.0 much earlier than we actually did [3]. That 7.1.0 release
fixed the initial issue that prevented us releasing 7.0.0. However, this doesn't
change the fact that 7.0.0 came out 6 weeks ago and not once did we get to test
it outside of OSC's gate, and what's worse, we (the SDK team) were totally
oblivious to the fact). Had it been known, we'd have cut a new release as soon
as the fix [4] for that issue had merged (August 12th, which for the record is
over 5 weeks ago).

This is not the first time this has happened to us with one of our own
dependencies. We had a very similar issue in Bobcat (2023.2) where we cut an
oslo.db 13.0.0 release very early in that cycle, followed by a number of
subsequent releases, only to have none of them end up making their way into u-c
due to issues with some services. We then got to the end of the cycle, realized
this, and had to do a frenzied mega revert to get us back to a pre-13.x state
and allow us to release an oslo.db deliverable for Bobcat. It also happened with
Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before
undergoing a similar mass revert.

Adding more fuel to this fire, it's not just a problem with our own dependencies
either. In fact, the problem is probably worse there. Need I remind people that
it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x (a
contributing factor in the oslo.db issue, fwiw). We had to struggle to get
Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You currently
can't build docs for many projects on Fedora hosts since we're capping, among
many other things, the Pillow library. The list goes on.

Which brings me to my main points, of which there are three. Firstly:

 * There needs to be a clear time limit for how long a u-c bump can remain
   unmerged and a plan for what happens if affected projects have not resolved
   their issues within that time. There will always be exceptions - no one could
   rightly have asked services to switch to SQLAlchemy 2.x in a fortnight, and
   libraries can have legit bugs that justify a block - but those exceptions
   should be well-understood and, as with the bumps themselves, time-limited. A
   project that needs 4 releases to adapt to a crucial new library version is
   not a healthy project.

secondly:

 * Caps or u-c bump reverts should similarly be clearly time-limited. We should
   not be defaulting to e.g re-capping Sphinx because your docs build has
   started to fail and then leaving that cap in place for months. Cap, fix the
   issue, and remove the cap asap.

and, finally:

 * A failing u-c bump for an openstack deliverable should be treated with the
   highest priority 🚨🚨🚨 and should be something that the corresponding team
   should be made aware of immediately so they can start coordinating a
   resolution. We should not have u-c patches languishing for weeks, let alone
   months, only for there to be a last minute panic like this.

I have ideas on the above (which basically warrant to more stick/[7]) but,
perhaps fortunately, this isn't something I can decide on my own. Nor, should I
add, is it something I expect the understaffed and oversubscribed release team
to be able to do themselves. Instead, I think it's something that the TC and
community as a whole need to settle on. So as soon as we get Dalmatian out the
door, let's do that.

Cheers,
Stephen

[1] https://review.opendev.org/c/openstack/requirements/+/925763
[2] https://review.opendev.org/c/openstack/requirements/+/928948
[3] https://review.opendev.org/c/openstack/releases/+/928838
[4] https://review.opendev.org/c/openstack/requirements/+/883141
[5] https://review.opendev.org/c/openstack/requirements/+/891694
[6] https://review.opendev.org/c/openstack/requirements/+/927102
[7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4