Re: upper-constraints (< $latest) considered harmful

19 Sep 2024

      I agree with your sentiments Stephen. Perhaps we could implement some sort
of automation, sending email to owners of failed bumps?
But I don't know what would be the best way to go about this. Especially
where would we get the information from?

On Wed, Sep 18, 2024 at 7:14 PM Stephen Finucane <stephenfin@redhat.com>
wrote:
...
o/
I'll jump straight into it. For those of you not in the loop we're
currently
going through a situation where it looks like OSC will be held back due to
a bug
that the Nova gates are highlighted. The actual details aren't hugely
important,
but what is important is the fact that this bug was first included in the
OSC
7.0.0 release, which was released on August 6th. For those of you who are
better
at mental calendar maths than I am (or already looked at your calendar),
you'll
know this was over 6 weeks ago, well before the various feature freeze
deadlines
. and you might ask why did we only spot this last week? That would be
because
the upper-constraint (henceforth referred to as u-c) bump failed [1] due a
different issue, meaning u-c kept us pinned at <= 6.6.1 which in turn meant
every CI job *except* the ones running in the OSC gate itself kept testing
the
old release instead of the thing we were supposed to be released as part of
Dalmatian. That only changed last week when 7.1.0 was released and the u-c
bump
for *that* merged [2], at which point people started shouting and
hollering.
Now, there is one other important consideration to factor into this, namely
that, due to a flaw in our tooling, we didn't end up cutting a release
branch
for OSC alongside the rest of the client-library deliverables. Had we done
so,
we'd have seen 7.1.0 much earlier than we actually did [3]. That 7.1.0
release
fixed the initial issue that prevented us releasing 7.0.0. However, this
doesn't
change the fact that 7.0.0 came out 6 weeks ago and not once did we get to
test
it outside of OSC's gate, and what's worse, we (the SDK team) were totally
oblivious to the fact). Had it been known, we'd have cut a new release as
soon
as the fix [4] for that issue had merged (August 12th, which for the
record is
over 5 weeks ago).
This is not the first time this has happened to us with one of our own
dependencies. We had a very similar issue in Bobcat (2023.2) where we cut
an
oslo.db 13.0.0 release very early in that cycle, followed by a number of
subsequent releases, only to have none of them end up making their way
into u-c
due to issues with some services. We then got to the end of the cycle,
realized
this, and had to do a frenzied mega revert to get us back to a pre-13.x
state
and allow us to release an oslo.db deliverable for Bobcat. It also
happened with
Castellan, whose 4.2.0 release sat waiting to merge for 4 months [4] before
undergoing a similar mass revert.
Adding more fuel to this fire, it's not just a problem with our own
dependencies
either. In fact, the problem is probably worse there. Need I remind people
that
it took 5 releases or 2 years, 8 months and change to uncap SQLAlchemy 2.x
(a
contributing factor in the oslo.db issue, fwiw). We had to struggle to get
Sphinx uncapped in Bobcat [5] and fight to not revert that cap. You
currently
can't build docs for many projects on Fedora hosts since we're capping,
among
many other things, the Pillow library. The list goes on.
Which brings me to my main points, of which there are three. Firstly:
* There needs to be a clear time limit for how long a u-c bump can remain
   unmerged and a plan for what happens if affected projects have not
resolved
   their issues within that time. There will always be exceptions - no one
could
   rightly have asked services to switch to SQLAlchemy 2.x in a fortnight,
and
   libraries can have legit bugs that justify a block - but those
exceptions
   should be well-understood and, as with the bumps themselves,
time-limited. A
   project that needs 4 releases to adapt to a crucial new library version
is
   not a healthy project.
secondly:
* Caps or u-c bump reverts should similarly be clearly time-limited. We
should
   not be defaulting to e.g re-capping Sphinx because your docs build has
   started to fail and then leaving that cap in place for months. Cap, fix
the
   issue, and remove the cap asap.
and, finally:
* A failing u-c bump for an openstack deliverable should be treated with
the
   highest priority 🚨🚨🚨 and should be something that the corresponding
team
   should be made aware of immediately so they can start coordinating a
   resolution. We should not have u-c patches languishing for weeks, let
alone
   months, only for there to be a last minute panic like this.
I have ideas on the above (which basically warrant to more stick/[7]) but,
perhaps fortunately, this isn't something I can decide on my own. Nor,
should I
add, is it something I expect the understaffed and oversubscribed release
team
to be able to do themselves. Instead, I think it's something that the TC
and
community as a whole need to settle on. So as soon as we get Dalmatian out
the
door, let's do that.
Cheers,
Stephen
[1] https://review.opendev.org/c/openstack/requirements/+/925763
[2] https://review.opendev.org/c/openstack/requirements/+/928948
[3] https://review.opendev.org/c/openstack/releases/+/928838
[4] https://review.opendev.org/c/openstack/requirements/+/883141
[5] https://review.opendev.org/c/openstack/requirements/+/891694
[6] https://review.opendev.org/c/openstack/requirements/+/927102
[7] https://y.yarn.co/dbdc0d5b-04e6-4c39-9cd3-44c3be3729ab.mp4

Re: upper-constraints (< $latest) considered harmful

Jiri Podivin