[openstack-dev] Thoughts on the patch test failure rate and moving forward
Joshua Harlow
harlowja at outlook.com
Thu Jul 24 19:38:19 UTC 2014
On Jul 24, 2014, at 12:08 PM, Anita Kuno <anteaya at anteaya.info> wrote:
> On 07/24/2014 12:40 PM, Daniel P. Berrange wrote:
>> On Wed, Jul 23, 2014 at 02:39:47PM -0700, James E. Blair wrote:
>>
>>> ==Future changes==
>>
>>> ===Fixing Faster===
>>>
>>> We introduce bugs to OpenStack at some constant rate, which piles up
>>> over time. Our systems currently treat all changes as equally risky and
>>> important to the health of the system, which makes landing code changes
>>> to fix key bugs slow when we're at a high reset rate. We've got a manual
>>> process of promoting changes today to get around this, but that's
>>> actually quite costly in people time, and takes getting all the right
>>> people together at once to promote changes. You can see a number of the
>>> changes we promoted during the gate storm in June [3], and it was no
>>> small number of fixes to get us back to a reasonably passing gate. We
>>> think that optimizing this system will help us land fixes to critical
>>> bugs faster.
>>>
>>> [3] https://etherpad.openstack.org/p/gatetriage-june2014
>>>
>>> The basic idea is to use the data from elastic recheck to identify that
>>> a patch is fixing a critical gate related bug. When one of these is
>>> found in the queues it will be given higher priority, including bubbling
>>> up to the top of the gate queue automatically. The manual promote
>>> process should no longer be needed, and instead bugs fixing elastic
>>> recheck tracked issues will be promoted automatically.
>>>
>>> At the same time we'll also promote review on critical gate bugs through
>>> making them visible in a number of different channels (like on elastic
>>> recheck pages, review day, and in the gerrit dashboards). The idea here
>>> again is to make the reviews that fix key bugs pop to the top of
>>> everyone's views.
>>
>> In some of the harder gate bugs I've looked at (especially the infamous
>> 'live snapshot' timeout bug), it has been damn hard to actually figure
>> out what's wrong. AFAIK, no one has ever been able to reproduce it
>> outside of the gate infrastructure. I've even gone as far as setting up
>> identical Ubuntu VMs to the ones used in the gate on a local cloud, and
>> running the tempest tests multiple times, but still can't reproduce what
>> happens on the gate machines themselves :-( As such we're relying on
>> code inspection and the collected log messages to try and figure out
>> what might be wrong.
>>
>> The gate collects alot of info and publishes it, but in this case I
>> have found the published logs to be insufficient - I needed to get
>> the more verbose libvirtd.log file. devstack has the ability to turn
>> this on via an environment variable, but it is disabled by default
>> because it would add 3% to the total size of logs collected per gate
>> job.
>>
>> There's no way for me to get that environment variable for devstack
>> turned on for a specific review I want to test with. In the end I
>> uploaded a change to nova which abused rootwrap to elevate privileges,
>> install extra deb packages, reconfigure libvirtd logging and restart
>> the libvirtd daemon.
>>
>> https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
>> https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py
>>
>> This let me get further, but still not resolve it. My next attack is
>> to build a custom QEMU binary and hack nova further so that it can
>> download my custom QEMU binary from a website onto the gate machine
>> and run the test with it. Failing that I'm going to be hacking things
>> to try to attach to QEMU in the gate with GDB and get stack traces.
>> Anything is doable thanks to rootwrap giving us a way to elevate
>> privileges from Nova, but it is a somewhat tedious approach.
>>
>> I'd like us to think about whether they is anything we can do to make
>> life easier in these kind of hard debugging scenarios where the regular
>> logs are not sufficient.
>>
>> Regards,
>> Daniel
>>
> For really really difficult bugs that can't be reproduced outside the
> gate, we do have the ability to hold vms if we know they have are
> displaying the bug, if they are caught before the vm in question is
> scheduled for deletion. In this case, make your intentions known in a
> discussion with a member of infra-root. A conversation will ensue
> involving what to do to get you what you need to continue debugging.
>
Why? Is space really that expensive? It boggles my mind a little that we have a well financed foundation (afaik, correct me if I am wrong...) but yet can't save 'all' the things in a smart manner (saving all the VMs snapshots doesn't mean saving hundreds/thousands of gigabytes when u are using de-duping cinder/glance... backends). Expire those VMs after a week if that helps but it feels like we shouldn't be so conservative about developers needs to have access to all the VMs that the gate used/created..., it's not like developers are trying to 'harm' openstack by investigating root issues that raw access to the VM images can provide (in fact it's to the contrary).
> It doesn't work in all cases, but some have found it helpful. Keep in
> mind you will be asked to demonstrate you have tried all other avenues
> before this one is exercised.
>
> Thanks,
> Anita.
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list