[openstack-dev] [nova] bug 1334398 and libvirt live snapshot support
Matt Riedemann
mriedem at linux.vnet.ibm.com
Wed Jan 14 22:03:08 UTC 2015
On 12/8/2014 3:12 PM, Jeremy Stanley wrote:
> On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote:
>> As Dan Berrangé noted, it's nearly impossible to reproduce this issue
>> independently outside of OpenStack Gating environment. I brought this up
>> at the recently concluded KVM Forum earlier this October. To debug this
>> any further, one of the QEMU block layer developers asked if we can get
>> QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested
>> this too, previously) to get further tracing details.
>
> We document thoroughly how to reproduce the environments we use for
> testing OpenStack. There's nothing rarified about "a Gate run" that
> anyone with access to a public cloud provider would be unable to
> reproduce, save being able to run it over and over enough times to
> expose less frequent failures.
>
>> FWIW, I myself couldn't reproduce it independently via libvirt
>> alone or via QMP (QEMU Machine Protocol) commands.
>>
>> Dan's workaround ("enable it permanently, except for under the
>> gate") sounds sensible to me.
> [...]
>
> I'm dubious of this as it basically says "we know this breaks
> sometimes, so we're going to stop testing that it works at all and
> possibly let it get even more broken, but you should be safe to rely
> on it anyway."
>
> The QA team tries very hard to make our integration testing
> environment as closely as possible mimic real-world deployment
> configurations. If these sorts of bugs emerge more often because of,
> for example, resource constraints in the test environment then it
> should be entirely likely they'd also be seen in production with the
> same frequency if run on similarly constrained equipment. And as
> we've observed in the past, any code path we stop testing quickly
> accumulates new bugs that go unnoticed until they impact someone's
> production environment at 3am.
>
Bringing this back up since Jesse Keating in IRC was asking about this
again today. Sounds like we've heard from a few people that are running
this in labs without problems, maybe they are patching libvirt/qemu, I
don't know, but we have other things that we know have broken parts and
that's why they run on the experimental queue, e.g. cells, nova +
ceph/rbd. We also know we're a bit busted in the ec2 API right now with
the latest boto release (2.35.1), so we have a cap on that.
These issues are being worked, but regarding this particular way that
we've disabled the function (with a version cap in the code), someone
has to go in and patch that out, which kind of sucks if they could have
just used a config option to enable it at their own risk.
That's why I'm proposing something like an [experimental] group. We
could put this into the [workarounds] group but this isn't really a
workaround for anything so that doesn't really make sense to me.
I'd personally be OK with putting it into the [libvirt] group with a
warning in the config option help and code that this isn't currently
tested in the gate so we aren't sure it's going to work, which we've
done for cells and some of the virt drivers, e.g. libvirt on
non-x86_64/QEMU systems.
--
Thanks,
Matt Riedemann
More information about the OpenStack-dev
mailing list