[openstack-dev] Thoughts on the patch test failure rate and moving forward

Sean Dague sean at dague.net
Thu Jul 24 20:01:39 UTC 2014


On 07/24/2014 12:40 PM, Daniel P. Berrange wrote:
> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
> 
>> ==Future changes==
> 
>> ===Fixing Faster===
>>
>> We introduce bugs to OpenStack at some constant rate, which piles up
>> over time. Our systems currently treat all changes as equally risky and
>> important to the health of the system, which makes landing code changes
>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>> process of promoting changes today to get around this, but that's
>> actually quite costly in people time, and takes getting all the right
>> people together at once to promote changes. You can see a number of the
>> changes we promoted during the gate storm in June [3], and it was no
>> small number of fixes to get us back to a reasonably passing gate. We
>> think that optimizing this system will help us land fixes to critical
>> bugs faster.
>>
>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>
>> The basic idea is to use the data from elastic recheck to identify that
>> a patch is fixing a critical gate related bug. When one of these is
>> found in the queues it will be given higher priority, including bubbling
>> up to the top of the gate queue automatically. The manual promote
>> process should no longer be needed, and instead bugs fixing elastic
>> recheck tracked issues will be promoted automatically.
>>
>> At the same time we'll also promote review on critical gate bugs through
>> making them visible in a number of different channels (like on elastic
>> recheck pages, review day, and in the gerrit dashboards). The idea here
>> again is to make the reviews that fix key bugs pop to the top of
>> everyone's views.
> 
> In some of the harder gate bugs I've looked at (especially the infamous
> 'live snapshot' timeout bug), it has been damn hard to actually figure
> out what's wrong. AFAIK, no one has ever been able to reproduce it
> outside of the gate infrastructure. I've even gone as far as setting up
> identical Ubuntu VMs to the ones used in the gate on a local cloud, and
> running the tempest tests multiple times, but still can't reproduce what
> happens on the gate machines themselves :-( As such we're relying on
> code inspection and the collected log messages to try and figure out
> what might be wrong.
> 
> The gate collects alot of info and publishes it, but in this case I
> have found the published logs to be insufficient - I needed to get
> the more verbose libvirtd.log file. devstack has the ability to turn
> this on via an environment variable, but it is disabled by default
> because it would add 3% to the total size of logs collected per gate
> job.

Right now we're at 95% full on 14 TB (which is the max # of volumes you
can attach to a single system in RAX), so every gig is sacred. There has
been a big push, which included the sprint last week in Darmstadt, to
get log data into swift, at which point our available storage goes way up.

So for right now, we're a little squashed. Hopefully within a month
we'll have the full solution.

As soon as we get those kinks out, I'd say we're in a position to flip
on that logging in devstack by default.

> There's no way for me to get that environment variable for devstack
> turned on for a specific review I want to test with. In the end I
> uploaded a change to nova which abused rootwrap to elevate privileges,
> install extra deb packages, reconfigure libvirtd logging and restart
> the libvirtd daemon.
> 
>   https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
>   https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
> 
> This let me get further, but still not resolve it. My next attack is
> to build a custom QEMU binary and hack nova further so that it can
> download my custom QEMU binary from a website onto the gate machine
> and run the test with it. Failing that I'm going to be hacking things
> to try to attach to QEMU in the gate with GDB and get stack traces.
> Anything is doable thanks to rootwrap giving us a way to elevate
> privileges from Nova, but it is a somewhat tedious approach.
> 
> I'd like us to think about whether they is anything we can do to make
> life easier in these kind of hard debugging scenarios where the regular
> logs are not sufficient.

Agreed. Honestly, though we do also need to figure out first fail
detection on our logs as well. Because realistically if we can't debug
failures from those, then I really don't understand how we're ever going
to expect large users to.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list