[openstack-dev] [nova] top gate bug is libvirt snapshot

Joe Gordon joe.gordon0 at gmail.com
Tue Jul 8 22:12:27 UTC 2014


On Tue, Jul 8, 2014 at 2:56 PM, Michael Still <mikal at stillhq.com> wrote:

> The associated bug says this is probably a qemu bug, so I think we
> should rephrase that to "we need to start thinking about how to make
> sure upstream changes don't break nova".
>

Good point.


Would running devstack-tempest on the latest upstream release of ? help.
Not as a voting job but as a periodic (third party?) job, that we can
hopefully identify these issues early on. I think the big question here is
who would volunteer to help run a job like this.


>
> Michael
>
> On Wed, Jul 9, 2014 at 7:50 AM, Joe Gordon <joe.gordon0 at gmail.com> wrote:
> >
> > On Thu, Jun 26, 2014 at 4:12 AM, Daniel P. Berrange <berrange at redhat.com
> >
> > wrote:
> >>
> >> On Thu, Jun 26, 2014 at 07:00:32AM -0400, Sean Dague wrote:
> >> > While the Trusty transition was mostly uneventful, it has exposed a
> >> > particular issue in libvirt, which is generating ~ 25% failure rate
> now
> >> > on most tempest jobs.
> >> >
> >> > As can be seen here -
> >> >
> >> >
> https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297
> >> >
> >> >
> >> > ... the libvirt live_snapshot code is something that our test pipeline
> >> > has never tested before, because it wasn't a new enough libvirt for us
> >> > to take that path.
> >> >
> >> > Right now it's exploding, a lot -
> >> > https://bugs.launchpad.net/nova/+bug/1334398
> >> >
> >> > Snapshotting gets used in Tempest to create images for testing, so
> image
> >> > setup tests are doing a decent number of snapshots. If I had to take a
> >> > completely *wild guess*, it's that libvirt can't do 2 live_snapshots
> at
> >> > the same time. It's probably something that most people haven't hit.
> The
> >> > wild guess is based on other libvirt issues we've hit that other
> people
> >> > haven't, and they are basically always a parallel ops triggered
> problem.
> >> >
> >> > My 'stop the bleeding' suggested fix is this -
> >> > https://review.openstack.org/#/c/102643/ which just effectively
> disables
> >> > this code path for now. Then we can get some libvirt experts engaged
> to
> >> > help figure out the right long term fix.
> >>
> >> Yes, this is a sensible pragmatic workaround for the short term until
> >> we diagnose the root cause & fix it.
> >>
> >> > I think there are a couple:
> >> >
> >> > 1) see if newer libvirt fixes this (1.2.5 just came out), and if so
> >> > mandate at some known working version. This would actually take a
> bunch
> >> > of work to be able to test a non packaged libvirt in our pipeline.
> We'd
> >> > need volunteers for that.
> >> >
> >> > 2) lock snapshot operations in nova-compute, so that we can only do 1
> at
> >> > a time. Hopefully it's just 2 snapshot operations that is the issue,
> not
> >> > any other libvirt op during a snapshot, so serializing snapshot ops in
> >> > n-compute could put the kid gloves on libvirt and make it not break
> >> > here. This also needs some volunteers as we're going to be playing a
> >> > game of progressive serialization until we get to a point where it
> looks
> >> > like the failures go away.
> >> >
> >> > 3) Roll back to precise. I put this idea here for completeness, but I
> >> > think it's a terrible choice. This is one isolated, previously
> untested
> >> > (by us), code path. We can't stay on libvirt 0.9.6 forever, so
> actually
> >> > need to fix this for real (be it in nova's use of libvirt, or libvirt
> >> > itself).
> >>
> >> Yep, since we *never* tested this code path in the gate before, rolling
> >> back to precise would not even really be a fix for the problem. It would
> >> merely mean we're not testing the code path again, which is really akin
> >> to sticking our head in the sand.
> >>
> >> > But for right now, we should stop the bleeding, so that nova/libvirt
> >> > isn't blocking everyone else from merging code.
> >>
> >> Agreed, we should merge the hack and treat the bug as release blocker
> >> to be resolve prior to Juno GA.
> >
> >
> >
> > How can we prevent libvirt issues like this from landing in trunk in the
> > first place? If we don't figure out a way to prevent this from landing
> the
> > first place I fear we will keep repeating this same pattern of failure.
> >
> >>
> >>
> >> Regards,
> >> Daniel
> >> --
> >> |: http://berrange.com      -o-
> http://www.flickr.com/photos/dberrange/
> >> :|
> >> |: http://libvirt.org              -o-
> http://virt-manager.org
> >> :|
> >> |: http://autobuild.org       -o-
> http://search.cpan.org/~danberr/
> >> :|
> >> |: http://entangle-photo.org       -o-
> http://live.gnome.org/gtk-vnc
> >> :|
> >>
> >> _______________________________________________
> >> OpenStack-dev mailing list
> >> OpenStack-dev at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
>
> --
> Rackspace Australia
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140708/fd34fbb1/attachment.html>


More information about the OpenStack-dev mailing list