Open Stack

Mon Dec 8 22:35:04 UTC 2014

On Mon, Dec 08, 2014 at 09:12:24PM +0000, Jeremy Stanley wrote:
> On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote:
> > As Dan Berrangé noted, it's nearly impossible to reproduce this issue
> > independently outside of OpenStack Gating environment. I brought this up
> > at the recently concluded KVM Forum earlier this October. To debug this
> > any further, one of the QEMU block layer developers asked if we can get
> > QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested
> > this too, previously) to get further tracing details.
> 
> We document thoroughly how to reproduce the environments we use for
> testing OpenStack. 

Yep, documentation is appreciated.

> There's nothing rarified about "a Gate run" that anyone with access to
> a public cloud provider would be unable to reproduce, save being able
> to run it over and over enough times to expose less frequent failures.

Sure. To be fair, this was actually tried. At the risk of over
discussing the topic, allow me to provide a bit more context, quoting
Dan's email from an old thread[1] ("Thoughts on the patch test failure
rate and moving forward" Jul 23, 2014) here for convenience:

    "In some of the harder gate bugs I've looked at (especially the
    infamous 'live snapshot' timeout bug), it has been damn hard to
    actually figure out what's wrong. AFAIK, no one has ever been able
    to reproduce it outside of the gate infrastructure. I've even gone
    as far as setting up identical Ubuntu VMs to the ones used in the
    gate on a local cloud, and running the tempest tests multiple times,
    but still can't reproduce what happens on the gate machines
    themselves :-( As such we're relying on code inspection and the
    collected log messages to try and figure out what might be wrong.

    The gate collects alot of info and publishes it, but in this case I
    have found the published logs to be insufficient - I needed to get
    the more verbose libvirtd.log file. devstack has the ability to turn
    this on via an environment variable, but it is disabled by default
    because it would add 3% to the total size of logs collected per gate
    job.

    There's no way for me to get that environment variable for devstack
    turned on for a specific review I want to test with. In the end I
    uploaded a change to nova which abused rootwrap to elevate
    privileges, install extra deb packages, reconfigure libvirtd logging
    and restart the libvirtd daemon.

       https://review.openstack.org/#/c/103066/11/etc/nova/rootwrap.d/compute.filters
       https://review.openstack.org/#/c/103066/11/nova/virt/libvirt/driver.py

    My next attack is to build a custom QEMU binary and hack nova
    further so that it can download my custom QEMU binary from a website
    onto the gate machine and run the test with it. Failing that I'm
    going to be hacking things to try to attach to QEMU in the gate with
    GDB and get stack traces.  Anything is doable thanks to rootwrap
    giving us a way to elevate privileges from Nova, but it is a
    somewhat tedious approach."

   [1] http://lists.openstack.org/pipermail/openstack-dev/2014-July/041148.html

To add to the above, from the bug, you can find in one of the plenty of
invocations, the above issue _was_ reproduced once, albiet with
questionable likelihood (details in the bug).

So, it's not that what you're suggesting was never tried. But, from the
above, you can clearly see what kind of convoluted methods you need to
resort to.

One concrete point from the above: it'd be very useful to have an env
variable that can be toggled to enable libvirt/QEMU run under `gdb` for
$REVIEW.

(Sure, it's a patch that needs to be worked on. . .)

[. . .]

> The QA team tries very hard to make our integration testing
> environment as closely as possible mimic real-world deployment
> configurations. If these sorts of bugs emerge more often because of,
> for example, resource constraints in the test environment then it
> should be entirely likely they'd also be seen in production with the
> same frequency if run on similarly constrained equipment. And as we've
> observed in the past, any code path we stop testing quickly
> accumulates new bugs that go unnoticed until they impact someone's
> production environment at 3am.

I realize you're raising the point that it should not be taken lightly
-- hope the context provided in this email demonstrates that it's not
the case.

PS: FWIW, I do enable this codepath in my test environments (sure, it's
not *representative*), but I'm yet to reproduce the bug.

-- 
/kashyap

Open Stack

[openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

OpenStack

Community

Documentation

Branding & Legal