[openstack-dev] Thoughts on the patch test failure rate and moving forward
Anita Kuno
anteaya at anteaya.info
Thu Jul 24 19:08:59 UTC 2014
On 07/24/2014 12:40 PM, Daniel P. Berrange wrote:
> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
>
>> ==Future changes==
>
>> ===Fixing Faster===
>>
>> We introduce bugs to OpenStack at some constant rate, which piles up
>> over time. Our systems currently treat all changes as equally risky and
>> important to the health of the system, which makes landing code changes
>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>> process of promoting changes today to get around this, but that's
>> actually quite costly in people time, and takes getting all the right
>> people together at once to promote changes. You can see a number of the
>> changes we promoted during the gate storm in June [3], and it was no
>> small number of fixes to get us back to a reasonably passing gate. We
>> think that optimizing this system will help us land fixes to critical
>> bugs faster.
>>
>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>
>> The basic idea is to use the data from elastic recheck to identify that
>> a patch is fixing a critical gate related bug. When one of these is
>> found in the queues it will be given higher priority, including bubbling
>> up to the top of the gate queue automatically. The manual promote
>> process should no longer be needed, and instead bugs fixing elastic
>> recheck tracked issues will be promoted automatically.
>>
>> At the same time we'll also promote review on critical gate bugs through
>> making them visible in a number of different channels (like on elastic
>> recheck pages, review day, and in the gerrit dashboards). The idea here
>> again is to make the reviews that fix key bugs pop to the top of
>> everyone's views.
>
> In some of the harder gate bugs I've looked at (especially the infamous
> 'live snapshot' timeout bug), it has been damn hard to actually figure
> out what's wrong. AFAIK, no one has ever been able to reproduce it
> outside of the gate infrastructure. I've even gone as far as setting up
> identical Ubuntu VMs to the ones used in the gate on a local cloud, and
> running the tempest tests multiple times, but still can't reproduce what
> happens on the gate machines themselves :-( As such we're relying on
> code inspection and the collected log messages to try and figure out
> what might be wrong.
>
> The gate collects alot of info and publishes it, but in this case I
> have found the published logs to be insufficient - I needed to get
> the more verbose libvirtd.log file. devstack has the ability to turn
> this on via an environment variable, but it is disabled by default
> because it would add 3% to the total size of logs collected per gate
> job.
>
> There's no way for me to get that environment variable for devstack
> turned on for a specific review I want to test with. In the end I
> uploaded a change to nova which abused rootwrap to elevate privileges,
> install extra deb packages, reconfigure libvirtd logging and restart
> the libvirtd daemon.
>
> https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
> https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
>
> This let me get further, but still not resolve it. My next attack is
> to build a custom QEMU binary and hack nova further so that it can
> download my custom QEMU binary from a website onto the gate machine
> and run the test with it. Failing that I'm going to be hacking things
> to try to attach to QEMU in the gate with GDB and get stack traces.
> Anything is doable thanks to rootwrap giving us a way to elevate
> privileges from Nova, but it is a somewhat tedious approach.
>
> I'd like us to think about whether they is anything we can do to make
> life easier in these kind of hard debugging scenarios where the regular
> logs are not sufficient.
>
> Regards,
> Daniel
>
For really really difficult bugs that can't be reproduced outside the
gate, we do have the ability to hold vms if we know they have are
displaying the bug, if they are caught before the vm in question is
scheduled for deletion. In this case, make your intentions known in a
discussion with a member of infra-root. A conversation will ensue
involving what to do to get you what you need to continue debugging.
It doesn't work in all cases, but some have found it helpful. Keep in
mind you will be asked to demonstrate you have tried all other avenues
before this one is exercised.
Thanks,
Anita.
More information about the OpenStack-dev
mailing list