[openstack-dev] [gate] large-ops failure spike
sean at dague.net
Wed Jan 20 16:05:24 UTC 2016
On 01/20/2016 10:03 AM, Matthew Treinish wrote:
> On Wed, Jan 20, 2016 at 07:45:16AM -0500, Sean Dague wrote:
>> The large-ops jobs jumped to a 50% fail in check, 25% fail in gate in
>> the last 24 hours.
>> There isn't an obvious culprit at this point. I spent some time this
> There is a very obvious culprit, pip 8 was released last night.  Every
> dsvm job was failing between the release and when the fixes  landed
> will have a spike like this. That graph has a 12 hour rolling average and the
> fixes landed less than 12 hours ago.
>> morning digging into it a bit. Possibly each individual instance build
>> got slower, possibly some other timeout is getting hit.
>> The large-ops jobs were largely maintained by Joe Gordon, who dug into
>> them when there were issues. He's not part of the community any more,
>> and I don't think there is currently a point person.
> I think you're conflating adding the jobs with maintaining them. Joe did
> initially add the jobs but he wasn't an active a maintainer as you're implying
> here. Well, no more so than he was for any other dsvm failure. Not having him
> around to help with failures anymore is an issue for all jobs not just the ones
> he added.
>> With no current maintainer, I'd suggest we make the jobs non voting -
> I'm -1 on this, we really don't want to remove jobs like this until we have
> equivalent coverage setup somewhere. Frankly there should just be a nova
> functional test that load similar testing with the fake virt driver. But, until
> that's done I think premature to make these non-voting.
>> I also suggest their time has probably come and gone. There is no one
>> active on them, and the Rally team is.
>> A pre-gating test job is only useful if someone is actively addressing
>> systematic fails. This job class no longer has it. We should thus retire it.
> While I agree with the sentiment I don't think this actually applies in
> practice, the idea of a formal maintainer for a job is kinda a pipe dream. Look
> at: http://status.openstack.org/elastic-recheck/data/uncategorized.html and
> identify the maintainers for all the jobs listed there and ask why they have
> uncategorized failures. Are you saying we should retire all those jobs because
> there isn't anyone signed up (in the non-existent registry of job maintainers)
> to watch the failures?
Right, I've largely given up on that list. The only job I currently
regularly am trying to figure out is the grenade job, as I feel
responsible for that. And I will definitely admit that is not where it
should be for pass rate.
I think that any test configuration which isn't free of races needs a
maintainer. The lack of that causes things like the current 23 hr gate
backlog, because projects just keep adding test jobs, and recheck grind
to get patches landed. Causing hours of delays to unrelated projects.
Causing the loss of critical fixes because people give up trying to get
something through our test pipeline.
Pipe dream? Maybe. However it doesn't seem like it should be.
More information about the OpenStack-dev