[openstack-dev] Thoughts on the patch test failure rate and moving forward

Daniel P. Berrange berrange at redhat.com
Thu Jul 24 16:40:31 UTC 2014

On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:

> ==Future changes==

> ===Fixing Faster===
> We introduce bugs to OpenStack at some constant rate, which piles up
> over time. Our systems currently treat all changes as equally risky and
> important to the health of the system, which makes landing code changes
> to fix key bugs slow when we're at a high reset rate. We've got a manual
> process of promoting changes today to get around this, but that's
> actually quite costly in people time, and takes getting all the right
> people together at once to promote changes. You can see a number of the
> changes we promoted during the gate storm in June [3], and it was no
> small number of fixes to get us back to a reasonably passing gate. We
> think that optimizing this system will help us land fixes to critical
> bugs faster.
> [3] https://etherpad.openstack.org/p/gatetriage-june2014
> The basic idea is to use the data from elastic recheck to identify that
> a patch is fixing a critical gate related bug. When one of these is
> found in the queues it will be given higher priority, including bubbling
> up to the top of the gate queue automatically. The manual promote
> process should no longer be needed, and instead bugs fixing elastic
> recheck tracked issues will be promoted automatically.
> At the same time we'll also promote review on critical gate bugs through
> making them visible in a number of different channels (like on elastic
> recheck pages, review day, and in the gerrit dashboards). The idea here
> again is to make the reviews that fix key bugs pop to the top of
> everyone's views.

In some of the harder gate bugs I've looked at (especially the infamous
'live snapshot' timeout bug), it has been damn hard to actually figure
out what's wrong. AFAIK, no one has ever been able to reproduce it
outside of the gate infrastructure. I've even gone as far as setting up
identical Ubuntu VMs to the ones used in the gate on a local cloud, and
running the tempest tests multiple times, but still can't reproduce what
happens on the gate machines themselves :-( As such we're relying on
code inspection and the collected log messages to try and figure out
what might be wrong.

The gate collects alot of info and publishes it, but in this case I
have found the published logs to be insufficient - I needed to get
the more verbose libvirtd.log file. devstack has the ability to turn
this on via an environment variable, but it is disabled by default
because it would add 3% to the total size of logs collected per gate

There's no way for me to get that environment variable for devstack
turned on for a specific review I want to test with. In the end I
uploaded a change to nova which abused rootwrap to elevate privileges,
install extra deb packages, reconfigure libvirtd logging and restart
the libvirtd daemon.


This let me get further, but still not resolve it. My next attack is
to build a custom QEMU binary and hack nova further so that it can
download my custom QEMU binary from a website onto the gate machine
and run the test with it. Failing that I'm going to be hacking things
to try to attach to QEMU in the gate with GDB and get stack traces.
Anything is doable thanks to rootwrap giving us a way to elevate
privileges from Nova, but it is a somewhat tedious approach.

I'd like us to think about whether they is anything we can do to make
life easier in these kind of hard debugging scenarios where the regular
logs are not sufficient.

|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

More information about the OpenStack-dev mailing list