Making stestr scheduler smarter with a lot of threads
Hi,
Currently, it looks like stestr calculate which thread should run which test at the beginning, by calculating some partition, and then launch all tests at once. With a lot of threads, the result is, at the end, only a few cores are in use, with all the others being idle.
Would there be a way to have a pool of available threads instead, and have stestr to give threads something to eat when they are available, instead of the current way? How much work would that be?
Cheers,
Thomas Goirand (zigo)
On Tue, 2024-07-09 at 04:38 +0200, Thomas Goirand wrote:
Hi,
Currently, it looks like stestr calculate which thread should run which test at the beginning, by calculating some partition, and then launch all tests at once. With a lot of threads, the result is, at the end, only a few cores are in use, with all the others being idle.
currently unless you overried it my understanidn is the deistrubtion of tests is done on the class level in a round robbin maner acroos the worker threads. as you said this is done prior to lauching the worker thread staticly pre start up by generating a file with the relevent distibution.
Would there be a way to have a pool of available threads instead, and have stestr to give threads something to eat when they are available, instead of the current way?
i think based on how this currnelty works that would require use to repealty spawn thread and genrate new workers after every class. effectivly pregenerate a set of task files at start up and when one thread complete grab the next file an lauch a new thread.
if you actully wanted to do this with a thread pool and dispatch tests into that i think i would be a larger rewrite.
How much work would that be?
im not very familar with the workings of this although i have had to debug it once or twice a few years ago due to gate issues but im not conviced this would be that easy to do in a more dynmic way you could likely hack toghete ther appoch of generateing may test list files and spawning thread up to n wiht less work then properly usign a thread pool and quing the test units but i suspect both would be more then a couple of hours work but i dont know if thats days or weeks. it likely depens on how familar people are with stestr, unfortuenlly there are not many that are.
Cheers,
Thomas Goirand (zigo)
On Tue, Jul 9, 2024, at 1:37 AM, smooney@redhat.com wrote:
On Tue, 2024-07-09 at 04:38 +0200, Thomas Goirand wrote:
Hi,
Currently, it looks like stestr calculate which thread should run which test at the beginning, by calculating some partition, and then launch all tests at once. With a lot of threads, the result is, at the end, only a few cores are in use, with all the others being idle.
currently unless you overried it my understanidn is the deistrubtion of tests is done on the class level in a round robbin maner acroos the worker threads. as you said this is done prior to lauching the worker thread staticly pre start up by generating a file with the relevent distibution.
The docs [0] say this "Currently the partitioning algorithm is simple round-robin for tests that stestr has not seen run before, and equal-time buckets for tests that stestr has seen run." which maintains the old testr behavior. Basically if there is run information in the stestr database it should use that to bucket tests more evenly. Are you seeing this behavior in a fresh checkout or with existing data in your database? One option for CI would be to record historical runs and preseed the database in fresh checkouts with that information.
Would there be a way to have a pool of available threads instead, and have stestr to give threads something to eat when they are available, instead of the current way?
i think based on how this currnelty works that would require use to repealty spawn thread and genrate new workers after every class. effectivly pregenerate a set of task files at start up and when one thread complete grab the next file an lauch a new thread.
if you actully wanted to do this with a thread pool and dispatch tests into that i think i would be a larger rewrite.
How much work would that be?
im not very familar with the workings of this although i have had to debug it once or twice a few years ago due to gate issues but im not conviced this would be that easy to do in a more dynmic way you could likely hack toghete ther appoch of generateing may test list files and spawning thread up to n wiht less work then properly usign a thread pool and quing the test units but i suspect both would be more then a couple of hours work but i dont know if thats days or weeks. it likely depens on how familar people are with stestr, unfortuenlly there are not many that are.
Cheers,
Thomas Goirand (zigo)
[0] https://stestr.readthedocs.io/en/latest/MANUAL.html#parallel-testing
On Tue, Jul 09, 2024 at 07:49:43AM -0700, Clark Boylan wrote:
On Tue, Jul 9, 2024, at 1:37 AM, smooney@redhat.com wrote:
On Tue, 2024-07-09 at 04:38 +0200, Thomas Goirand wrote:
Hi,
Currently, it looks like stestr calculate which thread should run which test at the beginning, by calculating some partition, and then launch all tests at once. With a lot of threads, the result is, at the end, only a few cores are in use, with all the others being idle.
currently unless you overried it my understanidn is the deistrubtion of tests is done on the class level in a round robbin maner acroos the worker threads. as you said this is done prior to lauching the worker thread staticly pre start up by generating a file with the relevent distibution.
The docs [0] say this "Currently the partitioning algorithm is simple round-robin for tests that stestr has not seen run before, and equal-time buckets for tests that stestr has seen run." which maintains the old testr behavior. Basically if there is run information in the stestr database it should use that to bucket tests more evenly. Are you seeing this behavior in a fresh checkout or with existing data in your database? One option for CI would be to record historical runs and preseed the database in fresh checkouts with that information.
Would there be a way to have a pool of available threads instead, and have stestr to give threads something to eat when they are available, instead of the current way?
i think based on how this currnelty works that would require use to repealty spawn thread and genrate new workers after every class. effectivly pregenerate a set of task files at start up and when one thread complete grab the next file an lauch a new thread.
if you actully wanted to do this with a thread pool and dispatch tests into that i think i would be a larger rewrite.
How much work would that be?
im not very familar with the workings of this although i have had to debug it once or twice a few years ago due to gate issues but im not conviced this would be that easy to do in a more dynmic way you could likely hack toghete ther appoch of generateing may test list files and spawning thread up to n wiht less work then properly usign a thread pool and quing the test units but i suspect both would be more then a couple of hours work but i dont know if thats days or weeks. it likely depens on how familar people are with stestr, unfortuenlly there are not many that are.
This isn't actually a new idea, it's something we initially discussed adding to stestr for Tempest and Nova unittests like 6-7 years ago.
I actually have a WIP pull request implementing this as an experimental feature from ~5 years ago here:
https://github.com/mtreinish/stestr/pull/271
At the time I got it working on Linux and macOS, but was struggling to get the IPC for result streaming working correctly on Windows. I was eventually going to make the option for POSIX compatible platforms only to side step this initially, but I never circled back to it because there wasn't a huge demand for the feature and I got distracted by other things. I just updated the branch and it looks like in the intervening time things have bit-rotted a bit and it's not working at all anymore.
But, Clark's analysis is correct, and typically if you have a historical run in the stestr database the historical timing based partitioning strategy does a good enough job with worker balance that this isn't a problem most of the time (which is why the demand for #271 hasn't been so high).
To take advantage of this in CI the trick a lot of people do in is cache the subunit result stream from the run and before running the tests you call `stestr load` on that cached result stream to populate the local database with historical data. Back in the day we used to do this with the subunit2sql database for Openstack, but since that's all been retired I'm not sure what the current status of any of this configuration is.
-Matt Treinish
[0] https://stestr.readthedocs.io/en/latest/MANUAL.html#parallel-testing
[1] https://stestr.readthedocs.io/en/stable/MANUAL.html#load
participants (4)
-
Clark Boylan
-
Matthew Treinish
-
smooney@redhat.com
-
Thomas Goirand