Open Stack

Fri Jan 24 19:23:44 UTC 2014

On Fri, Jan 24, 2014, at 10:51 AM, John Griffith wrote:
> On Fri, Jan 24, 2014 at 11:37 AM, Clay Gerrard <clay.gerrard at gmail.com>
> wrote:
> >>
> >>
> >> That's a pretty high rate of failure, and really needs investigation.
> >
> >
> > That's a great point, did you look into the logs of any of those jobs?
> > Thanks for bringing it to my attention.
> >
> > I saw a few swift tests that would pop, I'll open bugs to look into those.
> > But the cardinality of the failures (7) was dwarfed by jenkins failures I
> > don't quite understand.
> >
> > [EnvInject] - [ERROR] - SEVERE ERROR occurs: java.lang.InterruptedException
> > (e.g.
> > http://logs.openstack.org/86/66986/3/gate/gate-swift-python27/2e6a8fc/console.html)
> >
> > FATAL: command execution failed | java.io.InterruptedIOException (e.g.
> > http://logs.openstack.org/84/67584/5/gate/gate-swift-python27/4ad733d/console.html)
> >
> > These jobs are blowing up setting up the workspace on the slave, and we're
> > not automatically retrying them?  How can this only be effecting swift?
> 
> It's certainly not just swift:
> 
> http://logstash.openstack.org/#eyJzZWFyY2giOiJcImphdmEuaW8uSW50ZXJydXB0ZWRJT0V4Y2VwdGlvblwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxMzkwNTg5MTg4NjY5fQ==
> 
> >
> > -Clay
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

This isn't all doom and gloom, but rather an unfortunate side effect of
how Jenkins aborts jobs. When a job is aborted there are corner cases
where Jenkins does not catch all of the exceptions that may happen and
that results in reporting the build as a failure instead of an abort.
Now all of this would be fine if we never aborted jobs, but it turns out
Zuul aggressively aborts jobs when it knows the result of that job will
not help anything (either ability to merge or useful results to report
back to code reviewers).

I have a hunch (but would need to do a bunch of digging to confirm it)
that most of these errors are simply job aborts that happened in ways
that Jenkins couldn't recover from gracefully. Looking at the most
recent occurrence of this particular failure we see
https://review.openstack.org/#/c/66307 failed
gate-tempest-dsvm-neutron-large-ops. If we go to the comments on the
change we see that this particular failure was never reported, which
implies the failure happened as part of a build abort.

The other thing we can do to convince ourselves that this problem is
mostly a poor reporting of job aborts is restricting our logstash search
to build_queue:"check". Only the gate queue aborts jobs in this way so
occurrences in the check queue would indicate an actual problem. If we
do that we see a bunch of "hudson.remoting.RequestAbortedException"
which are also aborts not handled properly and since zuul shouldn't
abort the check queue were probably a result of some human aborting jobs
after a Zuul restart.

TL;DR I believe this is mostly a non issue and has to do with Zuul and
Jenkins quirks. If you see this error reported to Gerrit we should do
more digging.

Clark

Open Stack

[openstack-dev] Gate Status - Friday Edition

OpenStack

Community

Documentation

Branding & Legal