On Tue, Jul 09, 2024 at 07:49:43AM -0700, Clark Boylan wrote:
> On Tue, Jul 9, 2024, at 1:37 AM, smooney@redhat.com wrote:
> > On Tue, 2024-07-09 at 04:38 +0200, Thomas Goirand wrote:
> >> Hi,
> >>
> >> Currently, it looks like stestr calculate which thread should run which
> >> test at the beginning, by calculating some partition, and then launch
> >> all tests at once. With a lot of threads, the result is, at the end,
> >> only a few cores are in use, with all the others being idle.
> > currently unless you overried it my understanidn is the deistrubtion of
> > tests is done on the class level
> > in a round robbin maner acroos the worker threads.
> > as you said this is done prior to lauching the worker thread staticly
> > pre start up
> > by generating a file with the relevent distibution.
>
> The docs [0] say this "Currently the partitioning algorithm is simple round-robin for tests that stestr has not seen run before, and equal-time buckets for tests that stestr has seen run." which maintains the old testr behavior. Basically if there is run information in the stestr database it should use that to bucket tests more evenly. Are you seeing this behavior in a fresh checkout or with existing data in your database? One option for CI would be to record historical runs and preseed the database in fresh checkouts with that information.
>
> >>
> >> Would there be a way to have a pool of available threads instead, and
> >> have stestr to give threads something to eat when they are available,
> >> instead of the current way?
> > i think based on how this currnelty works that would require use to
> > repealty
> > spawn thread and genrate new workers after every class. effectivly
> > pregenerate a set of task files at start up and when one thread
> > complete grab the next file
> > an lauch a new thread.
> >
> > if you actully wanted to do this with a thread pool and dispatch tests
> > into that i think
> > i would be a larger rewrite.
> >
> >> How much work would that be?
> >
> > im not very familar with the workings of this although i have had to
> > debug it once or twice
> > a few years ago due to gate issues but im not conviced this would be
> > that easy to do in a more dynmic way
> > you could likely hack toghete ther appoch of generateing may test list
> > files and spawning thread up to n
> > wiht less work then properly usign a thread pool and quing the test
> > units but i suspect both would be
> > more then a couple of hours work but i dont know if thats days or weeks.
> > it likely depens on how familar people are with stestr, unfortuenlly
> > there are not many that are.
> >>

This isn't actually a new idea, it's something we initially discussed adding
to stestr for Tempest and Nova unittests like 6-7 years ago.

I actually have a WIP pull request implementing this as an experimental
feature from ~5 years ago here:

https://github.com/mtreinish/stestr/pull/271

At the time I got it working on Linux and macOS, but was struggling to
get the IPC for result streaming working correctly on Windows. I was eventually
going to make the option for POSIX compatible platforms only to side step this
initially, but I never circled back to it because there wasn't a huge demand for
the feature and I got distracted by other things. I just updated the branch and
it looks like in the intervening time things have bit-rotted a bit and it's not
working at all anymore.

But, Clark's analysis is correct, and typically if you have a historical run in
the stestr database the historical timing based partitioning strategy does a good
enough job with worker balance that this isn't a problem most of the time (which
is why the demand for #271 hasn't been so high).

To take advantage of this in CI the trick a lot of people do in is cache the
subunit result stream from the run and before running the tests you call
`stestr load` on that cached result stream to populate the local database with
historical data. Back in the day we used to do this with the subunit2sql
database for Openstack, but since that's all been retired I'm not sure what the
current status of any of this configuration is.

-Matt Treinish

>
> [0] https://stestr.readthedocs.io/en/latest/MANUAL.html#parallel-testing
[1] https://stestr.readthedocs.io/en/stable/MANUAL.html#load