[openstack-dev] [gate] gate-grenade-dsvm-multinode intermittent failures
sean at dague.net
Thu Jan 21 16:09:17 UTC 2016
On 01/21/2016 11:00 AM, Matthew Treinish wrote:
> On Thu, Jan 21, 2016 at 08:18:14AM -0500, Davanum Srinivas wrote:
>> Failures for this job has been trending up and is causing the large
>> gate queue as well. I've logged a bug:
>> and am requesting switching the voting to off for this job:
> I think this was premature, we were actually looking at the problem last night. If
> you look at:
> grenade-multinode is 100% failure on both providers. The working hypothesis is
> that it's because tempest is trying to login to the guest over the "private"
> network which isn't setup as accessible outside. You can see the discussion on
> this starting here:
>> We need to find and fix the underlying issue which can help us
>> determine when to switch this back on to voting or we cleanup this job
>> from all the gate queues and move them to check queues (i have a TODO
>> for this in this review)
> TBH, there is always this push to remove jobs or testing whenever there is
> release pressure and a gate backup. No one seems to notice whenever anything isn't
> working and recheck grinds patches through. (well maybe not you Dims, because
> you're more on top of it then almost everyone) I know that I get complacent when
> there isn't a gate backup. The problem is when things like our categorization rate
> routinely has been at or below 50% this cycle it's not really a surprise we have
> gate backups like this. More people need to be actively debugging these problems
> as they come up, it can't just be the same handful of us. I don't think making
> things non-voting is the trend we want to set because then what's gonna be the
> motivation to get others to help on this.
Deciding to stop everyone else's work while a key infrastructure / test
setup bug is being sorted isn't really an option.
It's an OpenStack global lock on all productivity.
Making jobs non voting means that it's a local lock instead of a global
one. That *has* to be the model for fixing things like this. We need to
get some agreement on that fact, otherwise there will never be more
volunteers to help fix things. Not everyone in the community can drop
all the work and context they have for solving hard problems because a
new cloud was added / upgraded / acts differently.
When your bus lights on fire you don't just keep driving with the bus
full of passengers. You pull over, let them get off, and deal with the
fire separately from the passengers.
If there is in flight work, by a set of people that are all going to
bed, handing that off with an email needs to happen. Especially if we
are expecting them to not just start over from scratch.
More information about the OpenStack-dev