[placement][ptg] Gate health management

Matt Riedemann mriedemos at gmail.com
Mon Apr 8 22:10:48 UTC 2019

On 4/8/2019 1:28 PM, Matthew Treinish wrote:
> As for how best to determine this. We actually aggregate all the data already
> in the subunit2sql db. openstack-health does provide a slowest job list
> aggregated over time per job using this data:
> http://status.openstack.org/openstack-health/#/job/tempest-full-py3
> You just change the sort column to "Mean Runtime". I think there is a bug
> in the rolling average function there because those numbers look wrong, but
> it should be relative numbers.
> I also had this old script on my laptop [2] which I used to get a list of
> tests ordered by average speed (over the last 300 runs) filtered for those
> which took > 10 seconds. I ran this just now and generated this list:
> http://paste.openstack.org/show/749016/
> The script is easily modifiable to change job or number of runs.
> (I also think I've shared a version of it on ML before)

The reason I started going down this route of trying to compromise 
between scenario tests being run serially vs full concurrency (4 workers 
in our gate) and choosing 2 workers was because we could definitely mark 
more tests as slow, but the tempest-slow* job is already setup to 
timeout at 3 hours (compared to 2 hours for tempest-full) and at some 
point we can only mark so many tests as slow before the tempest-slow job 
itself starts timing out, and I don't really want to wait 3+ hours for a 
job to timeout. I put more alternatives in the commit message here:


I think part of what we need to do is start ripping tests out of the 
main tempest repo and putting them into per-project plugins especially 
if they don't involve multiple services and are not interop tests 
(thinking cinder-backup tests for sure). I think enabling SSH validation 
by default also increased the overall job times quite a bit but that's 
totally anecdotal at this point.




More information about the openstack-discuss mailing list