Re: [placement][ptg] Gate health management

8 Apr 2019

      On 4/8/2019 1:28 PM, Matthew Treinish wrote:
...
As for how best to determine this. We actually aggregate all the data already
in the subunit2sql db. openstack-health does provide a slowest job list
aggregated over time per job using this data:
http://status.openstack.org/openstack-health/#/job/tempest-full-py3
You just change the sort column to "Mean Runtime". I think there is a bug
in the rolling average function there because those numbers look wrong, but
it should be relative numbers.
I also had this old script on my laptop [2] which I used to get a list of
tests ordered by average speed (over the last 300 runs) filtered for those
which took > 10 seconds. I ran this just now and generated this list:
http://paste.openstack.org/show/749016/
The script is easily modifiable to change job or number of runs.
(I also think I've shared a version of it on ML before)
The reason I started going down this route of trying to compromise 
between scenario tests being run serially vs full concurrency (4 workers 
in our gate) and choosing 2 workers was because we could definitely mark 
more tests as slow, but the tempest-slow* job is already setup to 
timeout at 3 hours (compared to 2 hours for tempest-full) and at some 
point we can only mark so many tests as slow before the tempest-slow job 
itself starts timing out, and I don't really want to wait 3+ hours for a 
job to timeout. I put more alternatives in the commit message here:

https://review.openstack.org/#/c/650300/

I think part of what we need to do is start ripping tests out of the 
main tempest repo and putting them into per-project plugins especially 
if they don't involve multiple services and are not interop tests 
(thinking cinder-backup tests for sure). I think enabling SSH validation 
by default also increased the overall job times quite a bit but that's 
totally anecdotal at this point.

-- 

Thanks,

Matt

Re: [placement][ptg] Gate health management

Matt Riedemann