[openstack-dev] Thoughts on the patch test failure rate and moving forward

Daniel P. Berrange berrange at redhat.com
Fri Jul 25 08:50:03 UTC 2014


On Thu, Jul 24, 2014 at 04:01:39PM -0400, Sean Dague wrote:
> On 07/24/2014 12:40 PM, Daniel P. Berrange wrote:
> > On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> > 
> >> ==Future changes==
> > 
> >> ===Fixing Faster===
> >>
> >> We introduce bugs to OpenStack at some constant rate, which piles up
> >> over time. Our systems currently treat all changes as equally risky and
> >> important to the health of the system, which makes landing code changes
> >> to fix key bugs slow when we're at a high reset rate. We've got a manual
> >> process of promoting changes today to get around this, but that's
> >> actually quite costly in people time, and takes getting all the right
> >> people together at once to promote changes. You can see a number of the
> >> changes we promoted during the gate storm in June [3], and it was no
> >> small number of fixes to get us back to a reasonably passing gate. We
> >> think that optimizing this system will help us land fixes to critical
> >> bugs faster.
> >>
> >> [3] https://etherpad.openstack.org/p/gatetriage-june2014
> >>
> >> The basic idea is to use the data from elastic recheck to identify that
> >> a patch is fixing a critical gate related bug. When one of these is
> >> found in the queues it will be given higher priority, including bubbling
> >> up to the top of the gate queue automatically. The manual promote
> >> process should no longer be needed, and instead bugs fixing elastic
> >> recheck tracked issues will be promoted automatically.
> >>
> >> At the same time we'll also promote review on critical gate bugs through
> >> making them visible in a number of different channels (like on elastic
> >> recheck pages, review day, and in the gerrit dashboards). The idea here
> >> again is to make the reviews that fix key bugs pop to the top of
> >> everyone's views.
> > 
> > In some of the harder gate bugs I've looked at (especially the infamous
> > 'live snapshot' timeout bug), it has been damn hard to actually figure
> > out what's wrong. AFAIK, no one has ever been able to reproduce it
> > outside of the gate infrastructure. I've even gone as far as setting up
> > identical Ubuntu VMs to the ones used in the gate on a local cloud, and
> > running the tempest tests multiple times, but still can't reproduce what
> > happens on the gate machines themselves :-( As such we're relying on
> > code inspection and the collected log messages to try and figure out
> > what might be wrong.
> > 
> > The gate collects alot of info and publishes it, but in this case I
> > have found the published logs to be insufficient - I needed to get
> > the more verbose libvirtd.log file. devstack has the ability to turn
> > this on via an environment variable, but it is disabled by default
> > because it would add 3% to the total size of logs collected per gate
> > job.
> 
> Right now we're at 95% full on 14 TB (which is the max # of volumes you
> can attach to a single system in RAX), so every gig is sacred. There has
> been a big push, which included the sprint last week in Darmstadt, to
> get log data into swift, at which point our available storage goes way up.
> 
> So for right now, we're a little squashed. Hopefully within a month
> we'll have the full solution.
>
> As soon as we get those kinks out, I'd say we're in a position to flip
> on that logging in devstack by default.

I don't particularly mind if we don't have libvirtdd.log verbose
debugging enabled by default, if there were a way to turn it on
for individual reviews we're debugging with.

> > There's no way for me to get that environment variable for devstack
> > turned on for a specific review I want to test with. In the end I
> > uploaded a change to nova which abused rootwrap to elevate privileges,
> > install extra deb packages, reconfigure libvirtd logging and restart
> > the libvirtd daemon.
> > 
> >   https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
> >   https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
> > 
> > This let me get further, but still not resolve it. My next attack is
> > to build a custom QEMU binary and hack nova further so that it can
> > download my custom QEMU binary from a website onto the gate machine
> > and run the test with it. Failing that I'm going to be hacking things
> > to try to attach to QEMU in the gate with GDB and get stack traces.
> > Anything is doable thanks to rootwrap giving us a way to elevate
> > privileges from Nova, but it is a somewhat tedious approach.
> > 
> > I'd like us to think about whether they is anything we can do to make
> > life easier in these kind of hard debugging scenarios where the regular
> > logs are not sufficient.
> 
> Agreed. Honestly, though we do also need to figure out first fail
> detection on our logs as well. Because realistically if we can't debug
> failures from those, then I really don't understand how we're ever going
> to expect large users to.

Ultimately there's always going ot be classes of bugs that are hard
or impossible for users to debug, which is why they'll engage vendors
for support of their OpenStack deployments. We should do as much as
possible to help them, but it is never going to be enough for all
possible types of bug.

I'm wondering though, if we can possibly do some changes so that when
we hit the test timeout, instead of simply killing the VM, we trigger
a core dump of the VM. libvirt has a facility to generate core dumps
of QEMU we could use, if we had a way to capture the core dumps as a
result, then we could attach a debugger to investigate

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|



More information about the OpenStack-dev mailing list