Open Stack

Thu Sep 21 14:44:06 UTC 2023

We currently face a big issue. An issue which could have huge impacts on
the whole Openstack and on our customers. By customers I mean all the users
outside the upstream community, operators, distros maintainers, IT vendors,
etc.

Our problem is that, for a given series, Bobcat in our case, there is
divergence between the versions that we announce as supported to our
customers, and the versions really supported in our runtime.

Let me describe the problem.

The oslo.db's versions supported within Bobcat's runtime [1] doesn't
reflect the reality of the versions really generated during Bobcat [2]. In
Bobcat's upper-constraints, oslo.db 12.3.2 [1] is the supported version.
This version corresponds in reality to the last version generated during
2023.1/antelope [3]. All the versions of oslo.db generated during Bobcat
are for now ignored by our runtime. However all these generated versions
are all listed in our technical documentation as supported by Bobcat.

In fact, the problem is that these oslo.db versions are all stuck in their
upper-constraints upgrade, because some cross-jobs failed and so the
upper-constraints update can't be made. These cross-job are owned by
different services (heat, manila, masakari, etc). We update our technical
documentation each time we produce a new version of a deliverable, so
before upgrading the upper-constraints. This is why the listed versions
diverge from the versions really supported at runtime.

We also face a similar issue with Castellan, but in the sake of clarity of
description of this problem I'll focus on oslo.db's case during the rest of
this thread.

>From a quantitative point of view, we face this kind of problem, from a
consecutive manner, since 2 series. It seems now that this becomes our
daily life with each new series of openstack. . At this rate it is very
likely that we will still be faced with this same problem during the next
series.

Indeed, during antelope, the same issue was thrown but only within one
deliverable [4][5][6]. With Bobcat this scenario reappears again but now
within two deliverables. The more the changes made in libraries are
important, the more we will face this kind of issues again, and as
everybody knows our libraries are all based on external libraries who could
evolve toward new major releases with breaking changes. That was the case
oslo.db where our goal was to migrate toward sqlalchemy 2.x. Leading to
stuck upper-constraints.

This problem could also impact all the downstream distros. Some distros
already started facing issues [7] with oslo.db's case.

We can't exclude that a similar issue will start to appear soon within all
the Openstack deliverables listed in upper-constraints. Oslo's case is the
first fruit.

>From a quality point of view, we also face a real issue. As customers can
establish their choices and their decisions on our technical documentation,
a divergence between officially supported versions and runtime supported
versions can have huge impacts for them. Imagine they decide to install a
specific series led by imposed requirements requested by a government, that
can be really problematic. By reading our technical documentation and our
release notes, they can think that we fulfill those prerequisites. This
kind of government requirement often arrives. It can be requested for a
vendor who wants to be allowed to sell to a government, or to be allowed to
respect some specific IT laws in a given country.

This last point can completely undermine the quality of the work carried
out upstream within the Openstack community.

So, now, we have to find the root causes of this problem.

In the current case, we would think that the root cause lives in the
complexity of oslo.db migration, yet this is not the case. Even if this
migration represents a major change in Openstack, it has been announced two
year ago [8] - the equivalent of 4 series -, leaving a lot of time for
every team to adopt the latest versions of oslo.db and sqlalchemy 2.x.

Stephen Finucane and Mike Bayer have spent a lot of time on this topic.
Stephen even contributed well beyond the oslo world, by proposing several
patches to migrate services [9]. Unfortunately a lot of these patches
remain yet unmerged and unreviewed [10], which has led us to this situation.

This migration is therefore by no means the root cause of this problem.

The root cause of this problem lurks in the volume of maintenance of
services. Indeed the main cause of this issue is that some services are not
able to follow the cadence, and therefore they slow down libraries'
evolutions and maintenance. Hence, their requirements cross job reflect
this fact [11]. This lack of activity is often due to the lack of
maintainers.

Fortunately Bobcat has been rescued by Stephen's recent fixes [12][13].
Stephen's elegant solution allowed us to solve failing cross jobs [14] and
hence, allowed us to resync our technical documentation and our runtime.

However, we can't ignore that the lack of maintainers is a growing trend
within Openstack. As evidenced by the constant decrease in the number of
contributors from series to series [15][16][17][18]. This phenomenon
therefore risks becoming more and more amplified.

So, we must provide a lasting response. A response more based on team
process than on isolated human resources.

A first solution could be to modify our workflow a little. We could update
our technical documentation by triggering a job with the upper-constraints
update rather than with a new release patch. Hence, the documentation and
the runtime will be well aligned. However, we should notice that not all
deliverables are listed in upper-constraints, hence this is a partial
solution that won't work for our services.

A second solution would be to monitor teams activity by monitoring the
upper-constraints updates with failing cross-job. That would be a new task
for the requirements team. The goal of this monitoring would be to inform
the TC that some deliverables are not active enough.

This monitoring would be to analyze, at defined milestones, which
upper-constraints update remains blocked for a while, and then look at the
cross-job failing to see if it is due to a lack of activity from the
service side. For example by analyzing if patches, like those proposed by
Stephen on services, remain unmerged. Then the TC would be informed.

It would be a kind of signal addressed to the TC. Then the TC would be free
to make a decision (abandoning this deliverable, removing cross-job,
put-your-idea-here).

The requirements team already provides such great job and expertise.
Without them we wouldn't have solved the oslo.db and castellan case in
time. However, I think we lack of aTC involvement a little bit earlier in
the series to avoid fire fighter moments. The monitoring would officialize
problems with deliverables sooners in the life cycle and would trigger a TC
involvement.

Here is the opportunity for us to act to better anticipate the growing
phenomenon of lack of maintainers. Here is the opportunity for us to better
anticipate our available human resources.
Here is the opportunity for us to better handle this kind of incident in
the future.

Thus, we could integrate substantive actions in terms of human resources
management into the  life cycle of Openstack.

It is time to manage this pain point, because in the long term, if nothing
is done now, this problem will repeat itself again and again.

Concerning the feasibility of this solution, the release team already
created some similar monitoring. This monitoring is made during each series
at specific milestones.

The requirements team could trigger its monitoring at specific milestones
targets, not too close to the series deadline. Hence we would be able to
anticipate decisions.

The requirements team could inspire from the release management process
[19] to create their own monitoring. We already own almost the things we
need to create a new process dedicated to this monitoring.

Hence, this solution is feasible.

The usefulness of this solution is obvious. Indeed, thus the TC would have
better governance monitoring. A monitoring not based on people elected as
TC members but based on process and so transmissible from a college to
another.

Therefore, three teams would then work together on the topic of decreasing
activity inside teams.

>From a global point of view, this will allow Openstack to more efficiently
keep pace with the resources available from series to series.

I would now like to special thank Stephen for his investment throughout
these two years dedicated to the oslo.db migration. I would especially like
to congratulate Stephen for the quality of the work carried out. Stephen
helped us to solve the problem in an elegant manner. Without his expertise,
delivering Bobcat would have been really painful. However, we should not
forget that Stephen remains a human resource of Openstack and we should not
forget that his expertise could go away from Openstack one day or one
other. Solving this type of problem cannot only rest on the shoulders of
one person. Let's take collective initiatives now and put in place
safeguards.

Thanks for your reading and thanks to all the people who helped with this
topic and that I have not cited here.

I think other solutions surely coexist and I'll be happy to discuss this
topic with you.

[1]
https://opendev.org/openstack/requirements/src/branch/master/upper-constraints.txt#L482
[2] https://releases.openstack.org/bobcat/index.html#bobcat-oslo-db
[3]
https://opendev.org/openstack/releases/src/branch/master/deliverables/antelope/oslo.db.yaml#L22
[4] https://review.opendev.org/c/openstack/requirements/+/873390
[5] https://review.opendev.org/c/openstack/requirements/+/878130
[6] https://opendev.org/openstack/oslo.log/compare/5.1.0...5.2.0
[7]
https://lists.openstack.org/pipermail/openstack-discuss/2023-September/035100.html
[8]
https://lists.openstack.org/pipermail/openstack-discuss/2021-August/024122.html
[9] https://review.opendev.org/q/topic:sqlalchemy-20
[10] https://review.opendev.org/q/topic:sqlalchemy-20+status:open
[11] https://review.opendev.org/c/openstack/requirements/+/887261
[12]
https://opendev.org/openstack/oslo.db/commit/115c3247b486c713176139422647144108101ca3
[13]
https://opendev.org/openstack/oslo.db/commit/4ee79141e601482fcde02f0cecfb561ecb79e1b6
[14] https://review.opendev.org/c/openstack/requirements/+/896053
[15] https://www.openstack.org/software/ussuri
[16] https://www.openstack.org/software/victoria
[17] https://www.openstack.org/software/xena
[18] https://www.openstack.org/software/antelope/
[19]
https://releases.openstack.org/reference/process.html#between-milestone-2-and-milestone-3

-- 
Hervé Beraud
Senior Software Engineer at Red Hat
irc: hberaud
https://github.com/4383/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230921/dbdfb92b/attachment-0001.htm>

Open Stack

[release][requirements][TC][oslo] how to manage divergence between runtime and doc?

OpenStack

Community

Documentation

Branding & Legal