Re: [all][oslo] revisiting the proposal to deprecate eventlet from oslo.service

22 Jan 2025

      On 22/01/2025 12:32, Takashi Kajinami wrote:
...
On 1/22/25 6:05 AM, Mike Bayer wrote:
...
Hi -
We've come to an impasse regarding our efforts to create a new 
backend selection system for oslo.service which will allow 
applications based on eventlet to have an incremental migration path 
where they can select for threads instead of eventlet [1] [2].  
Briefly, the change organizes oslo.service into two different 
"backends", known as "eventlet" and "threaded".  A calling 
application may choose to use either backend exclusively.   
Applications that use oslo.service right now are essentially using 
the "eventlet" backend.  As these applications seek to migrate away 
from eventlet, they'd select the "threaded" backend and tune their 
applications to work under this new mode, where eventually the 
"eventlet" backend would be removed from oslo.service and ultimately 
from all of openstack entirely.
All of that falls under things that have been agreed upon.  The 
current impasse is among three options of how this "backend" system 
in oslo.service may be informed of which backend should be used:
1. being selected at runtime via CONF from the .cfg file , which is 
what was originally agreed upon
2. at start up time via environment variable
3. as a fixed setting in source code.
Personally, as someone who has worked for decades on multiple 
switchable backend-oriented systems, I'm fully in the camp of choice 
3, which is that switching the backend is just the beginning of a 
whole series of changes and tuning that a codebase would need in 
order to accommodate this backend change, and having the ability to 
"switch the backend back to eventlet" in CONF or runtime is simply 
not something anyone would want to do, unless a considerable amount 
of additional effort were made to support the codebase being reliable 
under both idioms.  Which I would argue is a waste of effort since 
eventlet is going away.
More concretely, if you took nova, which were using oslo.service with 
eventlet and decided to try the "threaded" backend, that's naturally 
going to have a lot of implications for other parts of the code.   
For example, if your code is setting eventlet monkey patching, that 
has to be turned off when this backend selection is made.   If your 
code has other explicit use of eventlet, this also needs to be 
changed.  Finally, threaded concurrency is naturally going to "feel" 
different than coroutine-based concurrency - context switches happen 
at different places and with different frequencies.  Typically in a 
large codebase, the places that are prone to race conditions can 
shift; areas that had no races under coroutines now have threaded 
concurrency problems, areas that were sensitive to coroutine issues 
like being CPU bound are suddenly smooth with threading.   The 
overall number of "working routines" can change, whereas eventlet can 
spin up 1000 coroutines like popcorn, under threads we now have to 
work with pools of threads typically a few dozen or so threads 
large.   Overall, if I had my whole application passing under tempest 
under "threading", I would have had to make changes to get there, 
which may very well be mutually exclusive vs. test resiliency under 
eventlet.   More generally, getting my application to run under 
threads represents a change in implementation that I expect to be 
permanent.   If there are continued issues running under threaded 
mode, those are bugs that have to be fixed in any case.
ya so the first step i took was trying to remove explicit import of 
eventlet form the code outside of a a few functions in a single module.

if we had to spinkel if to see if we were monkey patches or not though 
out the code base that would be a non starter from a maintainability 
poit of view.

This will force use to refactor some code patterns and interface so use 
future or other changes to that the code can function with or without 
eventlet.

an example of this is we cant use eventlets timeout context manager 
anymore and need to either pass the timeout down to the lib we are 
callign if it suprpot it
or refactor the code to function diffently. i.e. dispatch the call to an 
executor and use the timeout on the returned future to limit the time we 
wait.

those types of changes are complex to maintain overtime and while we 
cant avoid those refactoings entirely, there will be less of them if we 
only support one model per

executable/binary in any given release.
...
...
In the discussion at [1], Takashi Kajinami writes that a PTG 
discussion (for which I dont have any link or background) explicitly 
decided against the above idea:
We discussed this during the cross project session on Wednesday during 
the discussion
about its upgrade impact and I remember John Garbutt raised his 
concern mainly about
performance impact of migration from eventlet and the possible 
switching back to eventlet
was proposed to mitigate the performance impact until all performance 
tests/improvements
are done. Unfortunately the details of the discussion is not captured 
in the etherpad
but we can see the related discussion which happened later in nova 
PTG[1].
[1] https://etherpad.opendev.org/p/nova-2025.1-ptg#L625
However, at that time we've not yet clearly evaluated the complexity 
of keeping backends
selectable and as mentioned in the above etherpad, we agreed that we 
may re-evaluate it
once we complete investigation.
I think we now clearly understand it is too much complicated to 
support both eventlet and
threading (some work has been done already in neutron and 
re-implementing the possibility
to switch back to eventlet there might not be feasible, from my view) 
and am in favor of
not providing options, which we planned initially.
so for nova we where considering using a nova specific environment 
valiabel for internal testing reasons but
our intent was to migrate one binary at a time and only have one 
implementation of each binary in any given release.
the initial work i started on too this approach and did not allow any 
configurablity.

we didn't intend to make this a runtime configuration that operator 
would use to switch between backends really
but we were evaulating if it wold be useful for use for ci.

im not actively working on this this cycle, i dont think anyone has had 
tiem to do it either so we could discuss this again
in the next ptg but this is likely not something that need to be 
implemented at the oslo level and would be a project by project discussions.
...
...
...
In the initial design we considered that developer may make the 
decision to migrate from eventlet to the new mechanism, and didn't 
intend to provide any options to switch back to the original 
eventlet model.
...
However there was some strong push back during the previous PTG and 
people were concerned with having no mechanism to switch back to 
"previously worked well" eventlet model and to address the concern 
we agreed to introduce the option for operators to switch the backend.
...
If we don't allow users to configure the option then IMO we should 
not expose the option and we should use an internal flag instead 
(though that's basically what was disagreed with during the PTG, AFAIR)
So first off, we have already realized that going with option 1. , 
use CONF, is not tenable, because the CONF collection is not 
typically in its final populated state when imports occur, and since 
the "backends" are an import time selection, there's a chicken/egg 
problem that cannot be resolved unless applications either highly 
modify the way in which their applications consume CONF vs. do their 
basic imports, or the way we are writing our backends in [1] needs to 
be modified so that imports can proceed using proxy objects which 
then change their loaded implementation when CONF is set up.   Both 
of these options seem deeply complex and overarchitected in order to 
support the feature of "I want to change the backend in CONF" which I 
maintain is not going to be a real world use case.
As a compromise, I suggested option 2. using an environment 
variable.   This allows the backend selection to remain something 
that can be theoretically changed from the outside without modifying 
source code.  However, current deployment practices suggest this is 
not really useful, as again per Takashi the way in which we deploy 
mod_wsgi does not allow environment variables to be local per virtual 
host; per [3] it looks like you need a hardcoded python script anyway 
that sets the env the way you want which makes it not too useful.
As we were proceeding on option 2. we received some more pushback on 
the "env" idea from Arnaud Morin which I don't disagree with, and 
still more pushback and arguments leading in favor of 3, here from 
Jay Faulkner:
...
Using environment variables to set backends means that operators can 
change what backend is used. This is an easy thing for them to screw 
up and will end up creating a large amount of operator pain and bug 
churn. I am -1 in the strongest terms to the use of 
operator-adjustable values to enable/disable backends.
The above was surprising considering that this was supposed to be a 
decision that had already been made at PTG, yet there still seems to 
be disagreement as though the decision were not actually made in any 
final way.
So as I am tasked along with Daniel Bengtsson and Herve Beraud with 
getting [1] merged and moving onto building out the threaded backend, 
I would like to ask the group here to give me some background on the 
concerns raised at PTG and if we can all just here revisit the whole 
issue and hopefully decide that at least to start, let's get this 
merged without any CONF/env variable process (option 3); an 
application that's gone through the effort to transition to threads 
with the help of this backend selector should be assumed to be moving 
forward with that implementation, and if it has problems, that's just 
an ordinary bug like any other.
there are two competing concerns.

i as the person in nova that was workign on it was concerned by the 
complexity makign this configurable would introduce and the impact that 
would have on completing the work.
others rightly raised concerns that orur ci does not necessarily give a 
good picture of the scalability/perfomce of any non eventlet based 
solution and we might need
to support installing with or without eventlet for one slurp release, 
specifically 2026.1 before fully removing the eventlet support in 2026.2 
or 2027.1 once all services are updated.

i don't think nova will be at a point where it can run without eventlet 
by 2026.1,

it might but my hope was that each release less and less of nova would 
use eventlet on a per binary (nova-comptue, nova-conductor, ...) basis.
...
...
thanks for reading!
[1] https://review.opendev.org/c/openstack/oslo.service/+/935783
[2] https://review.opendev.org/c/openstack/oslo-specs/+/927503
[3] https://gist.github.com/GrahamDumpleton/b380652b768e81a7f60c

Re: [all][oslo] revisiting the proposal to deprecate eventlet from oslo.service

Sean Mooney