On 4/8/2019 1:28 PM, Matthew Treinish wrote:
As for how best to determine this. We actually aggregate all the data already in the subunit2sql db. openstack-health does provide a slowest job list aggregated over time per job using this data:
http://status.openstack.org/openstack-health/#/job/tempest-full-py3
You just change the sort column to "Mean Runtime". I think there is a bug in the rolling average function there because those numbers look wrong, but it should be relative numbers.
I also had this old script on my laptop [2] which I used to get a list of tests ordered by average speed (over the last 300 runs) filtered for those which took > 10 seconds. I ran this just now and generated this list:
http://paste.openstack.org/show/749016/
The script is easily modifiable to change job or number of runs. (I also think I've shared a version of it on ML before)
The reason I started going down this route of trying to compromise between scenario tests being run serially vs full concurrency (4 workers in our gate) and choosing 2 workers was because we could definitely mark more tests as slow, but the tempest-slow* job is already setup to timeout at 3 hours (compared to 2 hours for tempest-full) and at some point we can only mark so many tests as slow before the tempest-slow job itself starts timing out, and I don't really want to wait 3+ hours for a job to timeout. I put more alternatives in the commit message here: https://review.openstack.org/#/c/650300/ I think part of what we need to do is start ripping tests out of the main tempest repo and putting them into per-project plugins especially if they don't involve multiple services and are not interop tests (thinking cinder-backup tests for sure). I think enabling SSH validation by default also increased the overall job times quite a bit but that's totally anecdotal at this point. -- Thanks, Matt