Open Stack

Wed Jan 14 23:34:14 UTC 2015

On 1/14/2015 4:03 PM, Matt Riedemann wrote:
>
>
> On 12/8/2014 3:12 PM, Jeremy Stanley wrote:
>> On 2014-12-08 11:45:36 +0100 (+0100), Kashyap Chamarthy wrote:
>>> As Dan Berrangé noted, it's nearly impossible to reproduce this issue
>>> independently outside of OpenStack Gating environment. I brought this up
>>> at the recently concluded KVM Forum earlier this October. To debug this
>>> any further, one of the QEMU block layer developers asked if we can get
>>> QEMU instance running on Gate run under `gdb` (IIRC, danpb suggested
>>> this too, previously) to get further tracing details.
>>
>> We document thoroughly how to reproduce the environments we use for
>> testing OpenStack. There's nothing rarified about "a Gate run" that
>> anyone with access to a public cloud provider would be unable to
>> reproduce, save being able to run it over and over enough times to
>> expose less frequent failures.
>>
>>> FWIW, I myself couldn't reproduce it independently via libvirt
>>> alone or via QMP (QEMU Machine Protocol) commands.
>>>
>>> Dan's workaround ("enable it permanently, except for under the
>>> gate") sounds sensible to me.
>> [...]
>>
>> I'm dubious of this as it basically says "we know this breaks
>> sometimes, so we're going to stop testing that it works at all and
>> possibly let it get even more broken, but you should be safe to rely
>> on it anyway."
>>
>> The QA team tries very hard to make our integration testing
>> environment as closely as possible mimic real-world deployment
>> configurations. If these sorts of bugs emerge more often because of,
>> for example, resource constraints in the test environment then it
>> should be entirely likely they'd also be seen in production with the
>> same frequency if run on similarly constrained equipment. And as
>> we've observed in the past, any code path we stop testing quickly
>> accumulates new bugs that go unnoticed until they impact someone's
>> production environment at 3am.
>>
>
> Bringing this back up since Jesse Keating in IRC was asking about this
> again today. Sounds like we've heard from a few people that are running
> this in labs without problems, maybe they are patching libvirt/qemu, I
> don't know, but we have other things that we know have broken parts and
> that's why they run on the experimental queue, e.g. cells, nova +
> ceph/rbd. We also know we're a bit busted in the ec2 API right now with
> the latest boto release (2.35.1), so we have a cap on that.
>
> These issues are being worked, but regarding this particular way that
> we've disabled the function (with a version cap in the code), someone
> has to go in and patch that out, which kind of sucks if they could have
> just used a config option to enable it at their own risk.
>
> That's why I'm proposing something like an [experimental] group. We
> could put this into the [workarounds] group but this isn't really a
> workaround for anything so that doesn't really make sense to me.
>
> I'd personally be OK with putting it into the [libvirt] group with a
> warning in the config option help and code that this isn't currently
> tested in the gate so we aren't sure it's going to work, which we've
> done for cells and some of the virt drivers, e.g. libvirt on
> non-x86_64/QEMU systems.
>

I'm going to play with this revert [1] on the Fedora 21 experimental 
queue which is running libvirt 1.2.9 and qemu 2.1.2.

Join me, won't you? :)

[1] https://review.openstack.org/#/c/147332/

-- 

Thanks,

Matt Riedemann

Open Stack

[openstack-dev] [nova] bug 1334398 and libvirt live snapshot support

OpenStack

Community

Documentation

Branding & Legal