On 22/01/2025 12:32, Takashi Kajinami wrote:
On 1/22/25 6:05 AM, Mike Bayer wrote:
Hi -
We've come to an impasse regarding our efforts to create a new backend selection system for oslo.service which will allow applications based on eventlet to have an incremental migration path where they can select for threads instead of eventlet [1] [2]. Briefly, the change organizes oslo.service into two different "backends", known as "eventlet" and "threaded". A calling application may choose to use either backend exclusively. Applications that use oslo.service right now are essentially using the "eventlet" backend. As these applications seek to migrate away from eventlet, they'd select the "threaded" backend and tune their applications to work under this new mode, where eventually the "eventlet" backend would be removed from oslo.service and ultimately from all of openstack entirely.
All of that falls under things that have been agreed upon. The current impasse is among three options of how this "backend" system in oslo.service may be informed of which backend should be used:
1. being selected at runtime via CONF from the .cfg file , which is what was originally agreed upon
2. at start up time via environment variable
3. as a fixed setting in source code.
Personally, as someone who has worked for decades on multiple switchable backend-oriented systems, I'm fully in the camp of choice 3, which is that switching the backend is just the beginning of a whole series of changes and tuning that a codebase would need in order to accommodate this backend change, and having the ability to "switch the backend back to eventlet" in CONF or runtime is simply not something anyone would want to do, unless a considerable amount of additional effort were made to support the codebase being reliable under both idioms. Which I would argue is a waste of effort since eventlet is going away.
More concretely, if you took nova, which were using oslo.service with eventlet and decided to try the "threaded" backend, that's naturally going to have a lot of implications for other parts of the code. For example, if your code is setting eventlet monkey patching, that has to be turned off when this backend selection is made. If your code has other explicit use of eventlet, this also needs to be changed. Finally, threaded concurrency is naturally going to "feel" different than coroutine-based concurrency - context switches happen at different places and with different frequencies. Typically in a large codebase, the places that are prone to race conditions can shift; areas that had no races under coroutines now have threaded concurrency problems, areas that were sensitive to coroutine issues like being CPU bound are suddenly smooth with threading. The overall number of "working routines" can change, whereas eventlet can spin up 1000 coroutines like popcorn, under threads we now have to work with pools of threads typically a few dozen or so threads large. Overall, if I had my whole application passing under tempest under "threading", I would have had to make changes to get there, which may very well be mutually exclusive vs. test resiliency under eventlet. More generally, getting my application to run under threads represents a change in implementation that I expect to be permanent. If there are continued issues running under threaded mode, those are bugs that have to be fixed in any case.
ya so the first step i took was trying to remove explicit import of eventlet form the code outside of a a few functions in a single module. if we had to spinkel if to see if we were monkey patches or not though out the code base that would be a non starter from a maintainability poit of view. This will force use to refactor some code patterns and interface so use future or other changes to that the code can function with or without eventlet. an example of this is we cant use eventlets timeout context manager anymore and need to either pass the timeout down to the lib we are callign if it suprpot it or refactor the code to function diffently. i.e. dispatch the call to an executor and use the timeout on the returned future to limit the time we wait. those types of changes are complex to maintain overtime and while we cant avoid those refactoings entirely, there will be less of them if we only support one model per executable/binary in any given release.
In the discussion at [1], Takashi Kajinami writes that a PTG discussion (for which I dont have any link or background) explicitly decided against the above idea:
We discussed this during the cross project session on Wednesday during the discussion about its upgrade impact and I remember John Garbutt raised his concern mainly about performance impact of migration from eventlet and the possible switching back to eventlet was proposed to mitigate the performance impact until all performance tests/improvements are done. Unfortunately the details of the discussion is not captured in the etherpad but we can see the related discussion which happened later in nova PTG[1].
[1] https://etherpad.opendev.org/p/nova-2025.1-ptg#L625
However, at that time we've not yet clearly evaluated the complexity of keeping backends selectable and as mentioned in the above etherpad, we agreed that we may re-evaluate it once we complete investigation.
I think we now clearly understand it is too much complicated to support both eventlet and threading (some work has been done already in neutron and re-implementing the possibility to switch back to eventlet there might not be feasible, from my view) and am in favor of not providing options, which we planned initially.
so for nova we where considering using a nova specific environment valiabel for internal testing reasons but our intent was to migrate one binary at a time and only have one implementation of each binary in any given release. the initial work i started on too this approach and did not allow any configurablity. we didn't intend to make this a runtime configuration that operator would use to switch between backends really but we were evaulating if it wold be useful for use for ci. im not actively working on this this cycle, i dont think anyone has had tiem to do it either so we could discuss this again in the next ptg but this is likely not something that need to be implemented at the oslo level and would be a project by project discussions.
In the initial design we considered that developer may make the decision to migrate from eventlet to the new mechanism, and didn't intend to provide any options to switch back to the original eventlet model.
However there was some strong push back during the previous PTG and people were concerned with having no mechanism to switch back to "previously worked well" eventlet model and to address the concern we agreed to introduce the option for operators to switch the backend.
If we don't allow users to configure the option then IMO we should not expose the option and we should use an internal flag instead (though that's basically what was disagreed with during the PTG, AFAIR)
So first off, we have already realized that going with option 1. , use CONF, is not tenable, because the CONF collection is not typically in its final populated state when imports occur, and since the "backends" are an import time selection, there's a chicken/egg problem that cannot be resolved unless applications either highly modify the way in which their applications consume CONF vs. do their basic imports, or the way we are writing our backends in [1] needs to be modified so that imports can proceed using proxy objects which then change their loaded implementation when CONF is set up. Both of these options seem deeply complex and overarchitected in order to support the feature of "I want to change the backend in CONF" which I maintain is not going to be a real world use case.
As a compromise, I suggested option 2. using an environment variable. This allows the backend selection to remain something that can be theoretically changed from the outside without modifying source code. However, current deployment practices suggest this is not really useful, as again per Takashi the way in which we deploy mod_wsgi does not allow environment variables to be local per virtual host; per [3] it looks like you need a hardcoded python script anyway that sets the env the way you want which makes it not too useful.
As we were proceeding on option 2. we received some more pushback on the "env" idea from Arnaud Morin which I don't disagree with, and still more pushback and arguments leading in favor of 3, here from Jay Faulkner:
Using environment variables to set backends means that operators can change what backend is used. This is an easy thing for them to screw up and will end up creating a large amount of operator pain and bug churn. I am -1 in the strongest terms to the use of operator-adjustable values to enable/disable backends.
The above was surprising considering that this was supposed to be a decision that had already been made at PTG, yet there still seems to be disagreement as though the decision were not actually made in any final way.
So as I am tasked along with Daniel Bengtsson and Herve Beraud with getting [1] merged and moving onto building out the threaded backend, I would like to ask the group here to give me some background on the concerns raised at PTG and if we can all just here revisit the whole issue and hopefully decide that at least to start, let's get this merged without any CONF/env variable process (option 3); an application that's gone through the effort to transition to threads with the help of this backend selector should be assumed to be moving forward with that implementation, and if it has problems, that's just an ordinary bug like any other.
there are two competing concerns. i as the person in nova that was workign on it was concerned by the complexity makign this configurable would introduce and the impact that would have on completing the work. others rightly raised concerns that orur ci does not necessarily give a good picture of the scalability/perfomce of any non eventlet based solution and we might need to support installing with or without eventlet for one slurp release, specifically 2026.1 before fully removing the eventlet support in 2026.2 or 2027.1 once all services are updated. i don't think nova will be at a point where it can run without eventlet by 2026.1, it might but my hope was that each release less and less of nova would use eventlet on a per binary (nova-comptue, nova-conductor, ...) basis.
thanks for reading!
[1] https://review.opendev.org/c/openstack/oslo.service/+/935783 [2] https://review.opendev.org/c/openstack/oslo-specs/+/927503 [3] https://gist.github.com/GrahamDumpleton/b380652b768e81a7f60c