On 16/06/2025 10:11, Balazs Gibizer wrote:
On Sat, Jun 14, 2025 at 1:24 AM <thomas@goirand.fr> wrote:
On Jun 13, 2025 20:52, Jay Faulkner <jay@gr-oss.io> wrote:
I'm confused a bit -- the implementation details of our threading modules are not a public API that we owe deprecation periods for. Why are we treating it as such?
-JayF Right. Plus I don't get why operators get to choose what class of bugs they may experience, and how they will know beter than contributors.
just to address one thing. we don't really intend to expose the configurabltiy to operators. we are building it in so that we (the core team) can test both version and choose when to move each component to the new mode. The environment setting could be set by operator to workaround bugs if/when they happen but our intent is we would choose the mode that it should be run in on a per binary basis and the env var will just be for our internal use. having it does provide use an escape hatch if there is high severity bug to revert back to the old mode of operation. we still have the ability to run os-vif in the cli mode using ovs-vsctl instead of the ovs python bindings https://github.com/openstack/os-vif/blob/master/vif_plug_ovs/ovs.py#L72-L82 that was vital when ovs changed there implementation such that a reconnect would block the nova-compute agent for multiples seconds. ironically that was also eventlet related but having the old, venerable, slow cli based driver as a fallback mitigated most of the impact until the ovs c and python bidning could be fixed. that took the better part of a year to do and have it released/ backpored. im not saying it will take use the same amount of time if we have a bug in the threading mode but its possible. we reported the eventlet related concurrency bug on 2021-05-24 https://bugs.launchpad.net/os-vif/+bug/1929446 the fix in ovsdbapp merved on Dec 2, 2021 https://github.com/openstack/ovsdbapp/commit/a2d3ef2a6491eb63b5ee961fc930070... and we still had backprot being merged of this up until 2023-05-22 as distros back-ported the original ovs change into older release of ovs. This is the type of "nasty bugs" gibi was referring too. i for one wanted to only support one mode of operation per service binary per release but i do see value if for no other reason then debugging of being able to revert to the old behavior. the fact we had the vsctl driver made it very clear that this ovs bug was in the ovs lib or python bindings as we coudl revert to the other impleation and show it only happend in the native code path.
The new concurrency model in nova (native threading) needs different performance tuning than the previous (eventlet). The cost of having 1000 eventlets is negligible but having 1000 threads to replace that will blow up the memory usage of the service. Operators expressed that having such tuning effort happening during upgrade without a temporary way back to the old model is scary. And honestly I agree.
Similarly we expect nasty bugs in the new model as it is a significant architectural change. So having no way to go back to a known good state temporarily while the bug is fixed or worked around is scarry.
Third, if we want to keep green CI while we are transforming nova services to the new model without keeping a big feature branch then we need to be able to land code that passes CI while things are half transformed. The only way we can do that is if we support both concurrency modes in parallel for a while.
Cheers, gibi
Cheers,
Thomas Goirand (zigo)