Hi Thomas,

any chance that worker 6 happened to execute tests from a single class? If I remember correctly (it's been a couple of years since I was experimenting with this) the scheduler in stestr wasn't able to schedule tests from the same class to different workers. As a result, in a few tempest plugins, we had to break down several classes that contained too many tests into several classes to avoid this limitation.

Depending on what your situation is, there are 2 things that could help, however, neither of those helps if there are many tests in a single test class as all this scheduling happens on class level:
1. Timing data - stestr saves time data per tests and takes that time into account when running the tests again - try to run your tests twice and if the second run has more favourable results, you can leverage the timing data (reuse them in future runs)

2. Worker file - there is a way how to force the scheduler to schedule the tests per your custom regex using the "--worker-file" - https://stestr.readthedocs.io/en/latest/MANUAL.html#test-scheduling

Hopefully some of this helps,

On Tue, 1 Oct 2024 at 01:37, Thomas Goirand <zigo@debian.org> wrote:
Hi,

This is the 2nd time I'm opening a thread about it. Sorry for this, but
it really bothers me.

Today, running tempest, I got these stats from stestr:

  - Worker 0 (132 tests) => 0:24:49.209885
  - Worker 1 (100 tests) => 0:17:14.791845
  - Worker 2 (124 tests) => 0:42:52.690906
  - Worker 3 (189 tests) => 0:41:21.307241
  - Worker 4 (159 tests) => 0:45:49.503031
  - Worker 5 (143 tests) => 0:28:13.282371
  - Worker 6 (156 tests) => 3:16:52.364976
  - Worker 7 (103 tests) => 0:46:05.366089

So, thread #1 ran 17 minutes, sitting idle for the rest of the 3 hours
run of thread #6. While I thought stestr was deviding the number of
tests by the number of thread, I don't get why worker #1 only had 100
tests assigned.

All together, all tests could have been ran within maybe less than an
hour of time (rather than 3h16 above), if not-yet-ran-tests were
reassigned to idle treads.

I'm currently spending a lot of time on running tempest. Indeed, I'm
running tempest on each OpenStack upgrade, from Victoria up to
Dalmatian. With the current way stestr run, it may take 2 full days to
do that (and that's not even counting when upgrade will fail and will
need fixing...), when it could be done in maybe 8 hours (if I count 1h
per OpenStack release).

So, knowing the above, it might be a good use of my time to dig into
stestr and see if I can fix this... or not!

Does anyone have a suggestion for another test runner, that's compatible
with stestr, at least for the tests selection with a regular expression?
As much as I know, pytest cannot take a regexp for test selection, can
it? Or is there maybe a plugin for it?

If there's no compatible test runner, where should I dig in the stestr
code to rewrite things in a smarter way?

Cheers,

Thomas Goirand (zigo)

P.S: On Caracal today, I just had this:

==============
Worker Balance
==============
  - Worker 0 (117 tests) => 0:59:32.291385
  - Worker 1 (147 tests) => 1:22:28.228980
  - Worker 2 (125 tests) => 0:45:20.969397
  - Worker 3 (114 tests) => 1:46:21.667579
  - Worker 4 (170 tests) => 0:45:27.577738
  - Worker 5 (190 tests) => 2:28:39.744920
  - Worker 6 (182 tests) => 2:29:01.402255
  - Worker 7 (152 tests) => 2:29:10.183359

this looks better, but that's probably 1/ random 2/ still not perfect,
with worker #0, #2 and #4 doing nothing 2/3rd of the time.



--
Martin Kopec (kopecmartin)