[OpenStack-Infra] On failing image builds

Ian Wienand iwienand at redhat.com
Wed Jun 17 23:47:52 UTC 2015


Hi,

I spent some time last week figuring out issues with centos kernel
failures which turned out to have been fixed in a recent update that
was not applied to some nodes due to build failures.

This prompted me to look a bit more closely at builds with [1].  The
results are not great.  We are having a lot of failures even in just
the centos/fedora builds I've been looking at [2]; with some days most
images failing to build.

Now I know there's things in motion here.  jhesketh is looking at the
git timeout issues, which are the major cause of problems (especially
note the saturday and sunday jobs go much better than presumably other
times when things are under load).

I know there is a spec out for better testing of images before
deployment which is slightly related.  I know there's a change out
there for a full REST API in nodepool.

Anyway, to avoid more problems like this, I think what I should do now
is expand this script to monitor not just centos/fedora and echo the
output to the infra-list.  Having sentinels in the log files [4] would
make this more reliable.

That way, we can quickly identify issues with builds without having a
manual process of digging through log files, hopefully notice patterns
of failure and distribute some of the load of checking on things.

-i

[1] https://github.com/ianw/nodechecker
[2] http://people.redhat.com/~iwienand/nodechecker-output/
[3] https://review.openstack.org/139598
[4] https://review.openstack.org/190889



More information about the OpenStack-Infra mailing list