Open Stack

Fri Aug 16 18:03:57 UTC 2013

On 2013-08-14 16:10, Matthew Treinish wrote:
> On Wed, Aug 14, 2013 at 11:05:35AM -0500, Ben Nemec wrote:
>> On 2013-08-13 16:39, Clark Boylan wrote:
>> >On Tue, Aug 13, 2013 at 1:25 PM, Matthew Treinish
>> ><mtreinish at kortar.org> wrote:
>> >>
>> >>Hi everyone,
>> >>
>> >>So for the past month or so I've been working on getting tempest
>> >>to work stably
>> >>with testr in parallel. As part of this you may have noticed the
>> >>testr-full
>> >>jobs that get run on the zuul check queue. I was using that job
>> >>to debug some
>> >>of the more obvious race conditions and stability issues with
>> >>running tempest
>> >>in parallel. After a bunch of fixes to tempest and finding some
>> >>real bugs in
>> >>some of the projects things seem to have smoothed out.
>> >>
>> >>So I pushed the testr-full run to the gate queue earlier today.
>> >>I'll be keeping
>> >>track of the success rate of this job vs the serial job and use
>> >>this as the
>> >>determining factor before we push this live to be the default
>> >>for all tempest
>> >>runs. So assuming that the success rate matches up well enough
>> >>with serial job
>> >>on the gate queue then I will push out the change that will
>> >>migrate all the
>> >>voting jobs to run in parallel hopefully either Friday afternoon
>> >>or early next
>> >>week. Also, if anyone has any input on what threshold they feel
>> >>is good enough
>> >>for this I'd welcome any input on that. For example, do we want
>> >>to ensure
>> >>a >= 1:1 match for job success? Or would something like 90% as
>> >>stable as the
>> >>serial job be good enough considering the speed advantage. (The
>> >>parallel runs
>> >>take about half as much time as a full serial run, the parallel
>> >>job normally
>> >>finishes in ~25-30min) Since this affects almost every project I
>> >>don't want to
>> >>define this threshold without input from everyone.
>> >>
>> >>After there is some more data for the gate queue's parallel job
>> >>I'll have some
>> >>pretty graphite graphs that I can share comparing the success
>> >>trends between
>> >>the parallel and serial jobs.
>> >>
>> >>So at this point we're in the home stretch and I'm asking for
>> >>everyone's help
>> >>in getting this merged. So, if everyone who is reviewing and
>> >>pushing commits
>> >>could watch the results from these non-voting jobs and if things
>> >>fail on the
>> >>parallel job but not the serial job please investigate the
>> >>failure and open a
>> >>bug if necessary. If it turns out to be a bug in tempest please
>> >>link it against
>> >>this blueprint:
>> >>
>> >>https://blueprints.launchpad.net/tempest/+spec/speed-up-tempest
>> >>
>> >>so that I'll give it the attention it deserves. I'd hate to get
>> >>this close to
>> >>getting this merged and have a bit of racy code get merged at
>> >>the last second
>> >>and block us for another week or two.
>> >>
>> >>I feel that we need to get this in before the H3 rush starts up
>> >>as it will help
>> >>everyone get through the extra review load faster.
>> >>
>> >Getting this in before the H3 rush would be very helpful. When we made
>> >the switch with Nova's unittests we fixed as many of the test bugs
>> >that we could find, merged the change to switch the test runner, then
>> >treated all failures as very high priority bugs that received
>> >immediate attention. Getting this in before H3 will give everyone a
>> >little more time to debug any potential new issues exposed by Jenkins
>> >or people running the tests locally.
>> >
>> >I think we should be bold here and merge this as soon as we have good
>> >numbers that indicate the trend is for these tests to pass. Graphite
>> >can give us the pass to fail ratios over time, as long as these trends
>> >are similar for both the old nosetest jobs and the new testr job I say
>> >we go for it. (Disclaimer: most of the projecst I work on are not
>> >affected by the tempest jobs; however, I am often called upon to help
>> >sort out issues in the gate).
>> 
>> I'm inclined to agree.  It's not as if we don't have transient
>> failures now, and if we're looking at a 50% speedup in
>> recheck/verify times then as long as the new version isn't
>> significantly less stable it should be a net improvement.
>> 
>> Of course, without hard numbers we're kind of discussing in a vacuum
>> here.
>> 
> 
> I also would like to get this in sooner rather than later and fix the 
> bugs as
> they come in. But, I'm wary of doing this because there isn't a proven 
> success
> history yet. No one likes gate resets, and I've only been running it on 
> the
> gate queue for a day now.
> 
> So here is the graphite graph that I'm using to watch parallel vs 
> serial in the
> gate queue:
> https://tinyurl.com/pdfz93l

Okay, so what are the y-axis units on this?  Because just guessing I 
would say that it's percentage of failing runs, in which case it looks 
like we're already within the 95% as accurate range (it never dips below 
-.05).  Am I reading it right?

> 
> On that graph the blue and yellow shows the number of jobs that 
> succeeded
> grouped together in per hour buckets. (yellow being parallel and blue 
> serial)
> 
> Then the red line is showing failures, a horizontal bar means that 
> there is no
> difference in the number of failures between serial and parallel. When 
> it dips
> negative it is showing a failure in parallel that wasn't on serial a 
> serial run
> at the same time. When it goes positive it showing a failure on serial 
> that
> doesn't occur on parallel at the same time. But, because the serial 
> runs take
> longer the failures happen at an offset. So if the plot shows parallel 
> fails
> followed closely by a serial failure than that is probably on the same 
> commit
> and not a parallel specific issue.
> 
> Based on the results so far it looks like there is probably still a 
> race or 2
> which would cause gate resets more than once in a day if we move to 
> parallel.
> But, it's getting closer, what does everyone think?
> 
> My only concern is the time it takes me to track down these and get the 
> fixes
> merged something new will pop up. For example, the last time it got 
> almost this
> close a nova patch got merged that broke almost all the aggregates 
> tests in
> tempest when running in parallel. Which prevented any run from 
> succeeding.
> 
> Thanks,
> 
> Matt Treinish

Open Stack

[openstack-dev] Migrating to testr parallel in tempest

OpenStack

Community

Documentation

Branding & Legal