[openstack-dev] [nova] top gate bug is libvirt snapshot
Michael Still
mikal at stillhq.com
Tue Jul 8 21:56:21 UTC 2014
The associated bug says this is probably a qemu bug, so I think we
should rephrase that to "we need to start thinking about how to make
sure upstream changes don't break nova".
Michael
On Wed, Jul 9, 2014 at 7:50 AM, Joe Gordon <joe.gordon0 at gmail.com> wrote:
>
> On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange <berrange at redhat.com>
> wrote:
>>
>> On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote:
>> > While the Trusty transition was mostly uneventful, it has exposed a
>> > particular issue in libvirt, which is generating ~ 25% failure rate now
>> > on most tempest jobs.
>> >
>> > As can be seen here -
>> >
>> > https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297
>> >
>> >
>> > ... the libvirt live_snapshot code is something that our test pipeline
>> > has never tested before, because it wasn't a new enough libvirt for us
>> > to take that path.
>> >
>> > Right now it's exploding, a lot -
>> > https://bugs.launchpad.net/nova/+bug/1334398
>> >
>> > Snapshotting gets used in Tempest to create images for testing, so image
>> > setup tests are doing a decent number of snapshots. If I had to take a
>> > completely *wild guess*, it's that libvirt can't do 2 live_snapshots at
>> > the same time. It's probably something that most people haven't hit. The
>> > wild guess is based on other libvirt issues we've hit that other people
>> > haven't, and they are basically always a parallel ops triggered problem.
>> >
>> > My 'stop the bleeding' suggested fix is this -
>> > https://review.openstack.org/#/c/102643/ which just effectively disables
>> > this code path for now. Then we can get some libvirt experts engaged to
>> > help figure out the right long term fix.
>>
>> Yes, this is a sensible pragmatic workaround for the short term until
>> we diagnose the root cause & fix it.
>>
>> > I think there are a couple:
>> >
>> > 1) see if newer libvirt fixes this (1.2.5 just came out), and if so
>> > mandate at some known working version. This would actually take a bunch
>> > of work to be able to test a non packaged libvirt in our pipeline. We'd
>> > need volunteers for that.
>> >
>> > 2) lock snapshot operations in nova-compute, so that we can only do 1 at
>> > a time. Hopefully it's just 2 snapshot operations that is the issue, not
>> > any other libvirt op during a snapshot, so serializing snapshot ops in
>> > n-compute could put the kid gloves on libvirt and make it not break
>> > here. This also needs some volunteers as we're going to be playing a
>> > game of progressive serialization until we get to a point where it looks
>> > like the failures go away.
>> >
>> > 3) Roll back to precise. I put this idea here for completeness, but I
>> > think it's a terrible choice. This is one isolated, previously untested
>> > (by us), code path. We can't stay on libvirt 0.9.6 forever, so actually
>> > need to fix this for real (be it in nova's use of libvirt, or libvirt
>> > itself).
>>
>> Yep, since we *never* tested this code path in the gate before, rolling
>> back to precise would not even really be a fix for the problem. It would
>> merely mean we're not testing the code path again, which is really akin
>> to sticking our head in the sand.
>>
>> > But for right now, we should stop the bleeding, so that nova/libvirt
>> > isn't blocking everyone else from merging code.
>>
>> Agreed, we should merge the hack and treat the bug as release blocker
>> to be resolve prior to Juno GA.
>
>
>
> How can we prevent libvirt issues like this from landing in trunk in the
> first place? If we don't figure out a way to prevent this from landing the
> first place I fear we will keep repeating this same pattern of failure.
>
>>
>>
>> Regards,
>> Daniel
>> --
>> |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/
>> :|
>> |: http://libvirt.org -o- http://virt-manager.org
>> :|
>> |: http://autobuild.org -o- http://search.cpan.org/~danberr/
>> :|
>> |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc
>> :|
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
--
Rackspace Australia
More information about the OpenStack-dev
mailing list