Open Stack

Thu Aug 22 21:28:04 UTC 2013

On Wed, Aug 14, 2013 at 06:13:18PM -0300, Thierry Carrez wrote:
> Matthew Treinish wrote:
> > Also, if anyone has any input on what threshold they feel is good enough
> > for this I'd welcome any input on that. For example, do we want to ensure 
> > a >= 1:1 match for job success? Or would something like 90% as stable as the
> > serial job be good enough considering the speed advantage. (The parallel runs
> > take about half as much time as a full serial run, the parallel job normally
> > finishes in ~25-30min) Since this affects almost every project I don't want to
> > define this threshold without input from everyone.
> 
> I guess 90% would be the limit where we'd start questioning it. 95% as
> stable then the speed improvement makes it definitely worth it IMHO. At
> 85% we would introduce way too many new false negatives in the tests,
> and those are painful to work around...
> 

So after fighting with the numbers a bit I'm having a hard time quantifying how
much flakier the testr runs are in practice. There has been too much variability
in the gate lately. It's also hard to quantify the failure rate vs serial when
there are gate resets, since the parallel runs don't always show up as aborted
because of the speed difference with the serial runs. That combined with there
being multiple serial jobs in different configurations that are all gating makes
just getting a percentage not the most straightforward problem. I can try to
invest more time figuring out how to best visualize this using graphite. But,
my gut feeling after watching zuul and the jenkins job is that it'll probably
end up with a few more random failures (< 5) every couple of hours. The most
common random fails that I've seen are documented with these 3 bugs:

https://bugs.launchpad.net/tempest/+bug/1213209
https://bugs.launchpad.net/tempest/+bug/1213212
https://bugs.launchpad.net/tempest/+bug/1213215

However, the issues occurring in those bugs are subtle enough that I'm having
trouble debugging them just from the gate logs, and so far I haven't been able
to reproduce them locally. I think the only way to have them get the attention
they need is to start gating with them being random failures. I'm thinking that
it would be better to start gating on parallel now and debug these as they come
up.

Another option that I've thought about is making the testr-full jobs voting on
the check queue. This way it will raise parallel failures to peoples attention
but not increase the number of gate resets. The only tradeoff here is that it
will make the voting gate jobs differ from the voting check jobs which is
something that we try to avoid. So I'm not sure it's a real option.

Assuming everyone is ok with green lighting parallel tempest with a couple
of known bugs then the only real blocker right now is that neutron does not work
with tempest in parallel. I'm looking into getting this working with neutron
(there are a couple of issues with it right now), but if the demand for faster
tempest is there we can keep the neutron-smoke jobs serial for the time being,
and bring it parallel once it's ready.

I really don't want to summarily decide whether it's ok to leave neutron serial
for the time being, or whether the random the failure rate is low enough to make
the switch now since it's a decision that affects almost all the projects. But,
at the same time I don't think that anyone else is really watching the 
gate-tempest-devstack-vm-testr-full jobs. Does anyone else have an opinion on
how we should proceed here?

Thanks,

Matt Treinish

Open Stack

[openstack-dev] Migrating to testr parallel in tempest

OpenStack

Community

Documentation

Branding & Legal