[placement][ptg] Gate health management
From the etherpad [1]:
* We have to run grenade and tempest. They are both very very slow and very unreliable (compared to the other tests). That's a pain. How do we help fix it? * Profile where time is spent setting up devstack? * Figure out where the slowest tests are which aren't marked slow: http://status.openstack.org/elastic-recheck/#1783405 * Figure out which (non-interop) tests in tempest could be split out to per-project repos or tempest plugins, i.e. if there are compute API tests which don't rely on other services or a hypervisor, and are just API/DB, they could be dealt with in nova functional tests instead. * Similarly: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000868.... The main thing to add here: We can come up with strategies on how to fix it, but how can we assure that there is time/energy/people to do that work? [1] https://etherpad.openstack.org/p/placement-ptg-train -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent
On 4/8/2019 11:42 AM, Chris Dent wrote:
* Figure out where the slowest tests are which aren't marked slow: http://status.openstack.org/elastic-recheck/#1783405
I tried something related to this last week to run the tempest-full* scenario tests with 2 workers concurrently rather than serially: https://review.openstack.org/#/c/650300/ But looking at the stackviz output on that it doesn't seem to have worked at all, the scenario tests running at the end appear to still be running in serial. I don't know if that is a bug in *testr* or what - maybe mtreinish knows. -- Thanks, Matt
On Mon, Apr 8, 2019, at 9:58 AM, Matt Riedemann wrote:
On 4/8/2019 11:42 AM, Chris Dent wrote:
* Figure out where the slowest tests are which aren't marked slow: http://status.openstack.org/elastic-recheck/#1783405
I tried something related to this last week to run the tempest-full* scenario tests with 2 workers concurrently rather than serially:
https://review.openstack.org/#/c/650300/
But looking at the stackviz output on that it doesn't seem to have worked at all, the scenario tests running at the end appear to still be running in serial. I don't know if that is a bug in *testr* or what - maybe mtreinish knows.
I might not be looking in the right place but the logs [0] seem to show the scenario tests running with 4 workers at the end of the job. I do know that tempest aggregates jobs by class in a worker which does mean if you get a bunch of slower classes (by test runtime) in a single worker that one worker will spend more time running. This does seem to have happened with the first tox tempest run in that job. Worker zero ran for an additional 15 minutes after the other three workers had completed. [0] http://logs.openstack.org/00/650300/1/check/tempest-full/1edb016/job-output....
On Mon, Apr 08, 2019 at 11:57:51AM -0500, Matt Riedemann wrote:
On 4/8/2019 11:42 AM, Chris Dent wrote:
* Figure out where the slowest tests are which aren't marked slow: http://status.openstack.org/elastic-recheck/#1783405
I tried something related to this last week to run the tempest-full* scenario tests with 2 workers concurrently rather than serially:
https://review.openstack.org/#/c/650300/
But looking at the stackviz output on that it doesn't seem to have worked at all, the scenario tests running at the end appear to still be running in serial. I don't know if that is a bug in *testr* or what - maybe mtreinish knows.
This was a concious decision made to reduce the load during the scenario tests. The scenario tests are run serially after all the other tests are run in parallel. [1] The volume tests in particular were stressing the test environments a lot ~2yrs ago so this was done to mitigate that. There are more details in the commit message making the change: https://review.openstack.org/#/c/439698/ (FWIW I mildly disagreed with this direction, but not enough to block it) As for how best to determine this. We actually aggregate all the data already in the subunit2sql db. openstack-health does provide a slowest job list aggregated over time per job using this data: http://status.openstack.org/openstack-health/#/job/tempest-full-py3 You just change the sort column to "Mean Runtime". I think there is a bug in the rolling average function there because those numbers look wrong, but it should be relative numbers. I also had this old script on my laptop [2] which I used to get a list of tests ordered by average speed (over the last 300 runs) filtered for those which took > 10 seconds. I ran this just now and generated this list: http://paste.openstack.org/show/749016/ The script is easily modifiable to change job or number of runs. (I also think I've shared a version of it on ML before) -Matt Treinish [1] https://github.com/openstack/tempest/blob/master/tox.ini#L107-L109 [2] http://paste.openstack.org/show/749015/
On 4/8/2019 1:28 PM, Matthew Treinish wrote:
As for how best to determine this. We actually aggregate all the data already in the subunit2sql db. openstack-health does provide a slowest job list aggregated over time per job using this data:
http://status.openstack.org/openstack-health/#/job/tempest-full-py3
You just change the sort column to "Mean Runtime". I think there is a bug in the rolling average function there because those numbers look wrong, but it should be relative numbers.
I also had this old script on my laptop [2] which I used to get a list of tests ordered by average speed (over the last 300 runs) filtered for those which took > 10 seconds. I ran this just now and generated this list:
http://paste.openstack.org/show/749016/
The script is easily modifiable to change job or number of runs. (I also think I've shared a version of it on ML before)
The reason I started going down this route of trying to compromise between scenario tests being run serially vs full concurrency (4 workers in our gate) and choosing 2 workers was because we could definitely mark more tests as slow, but the tempest-slow* job is already setup to timeout at 3 hours (compared to 2 hours for tempest-full) and at some point we can only mark so many tests as slow before the tempest-slow job itself starts timing out, and I don't really want to wait 3+ hours for a job to timeout. I put more alternatives in the commit message here: https://review.openstack.org/#/c/650300/ I think part of what we need to do is start ripping tests out of the main tempest repo and putting them into per-project plugins especially if they don't involve multiple services and are not interop tests (thinking cinder-backup tests for sure). I think enabling SSH validation by default also increased the overall job times quite a bit but that's totally anecdotal at this point. -- Thanks, Matt
---- On Mon, 08 Apr 2019 11:42:45 -0500 Chris Dent <cdent+os@anticdent.org> wrote ----
From the etherpad [1]:
* We have to run grenade and tempest. They are both very very slow and very unreliable (compared to the other tests). That's a pain. How do we help fix it? * Profile where time is spent setting up devstack? * Figure out where the slowest tests are which aren't marked slow: http://status.openstack.org/elastic-recheck/#1783405 * Figure out which (non-interop) tests in tempest could be split out to per-project repos or tempest plugins, i.e. if there are compute API tests which don't rely on other services or a hypervisor, and are just API/DB, they could be dealt with in nova functional tests instead. * Similarly: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000868....
The main thing to add here: We can come up with strategies on how to fix it, but how can we assure that there is time/energy/people to do that work?
Thanks Chris for bringing this up. This is the general problem faced by all projects running tempest-full. I have added a topic for QA PTG also to discuss the same where we can have all those projects representative and find the best solution. - https://etherpad.openstack.org/p/qa-train-ptg ("How to make tempest-full stable") -gmann
[1] https://etherpad.openstack.org/p/placement-ptg-train -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent
participants (5)
-
Chris Dent
-
Clark Boylan
-
Ghanshyam Mann
-
Matt Riedemann
-
Matthew Treinish