[openstack-dev] [nova] top gate bug is libvirt snapshot

Sean Dague sean at dague.net
Thu Jun 26 11:00:32 UTC 2014


While the Trusty transition was mostly uneventful, it has exposed a
particular issue in libvirt, which is generating ~ 25% failure rate now
on most tempest jobs.

As can be seen here -
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L294-L297


... the libvirt live_snapshot code is something that our test pipeline
has never tested before, because it wasn't a new enough libvirt for us
to take that path.

Right now it's exploding, a lot -
https://bugs.launchpad.net/nova/+bug/1334398

Snapshotting gets used in Tempest to create images for testing, so image
setup tests are doing a decent number of snapshots. If I had to take a
completely *wild guess*, it's that libvirt can't do 2 live_snapshots at
the same time. It's probably something that most people haven't hit. The
wild guess is based on other libvirt issues we've hit that other people
haven't, and they are basically always a parallel ops triggered problem.

My 'stop the bleeding' suggested fix is this -
https://review.openstack.org/#/c/102643/ which just effectively disables
this code path for now. Then we can get some libvirt experts engaged to
help figure out the right long term fix.

I think there are a couple:

1) see if newer libvirt fixes this (1.2.5 just came out), and if so
mandate at some known working version. This would actually take a bunch
of work to be able to test a non packaged libvirt in our pipeline. We'd
need volunteers for that.

2) lock snapshot operations in nova-compute, so that we can only do 1 at
a time. Hopefully it's just 2 snapshot operations that is the issue, not
any other libvirt op during a snapshot, so serializing snapshot ops in
n-compute could put the kid gloves on libvirt and make it not break
here. This also needs some volunteers as we're going to be playing a
game of progressive serialization until we get to a point where it looks
like the failures go away.

3) Roll back to precise. I put this idea here for completeness, but I
think it's a terrible choice. This is one isolated, previously untested
(by us), code path. We can't stay on libvirt 0.9.6 forever, so actually
need to fix this for real (be it in nova's use of libvirt, or libvirt
itself).

There might be other options as well, ideas welcomed.

But for right now, we should stop the bleeding, so that nova/libvirt
isn't blocking everyone else from merging code.

	-Sean

-- 
Sean Dague
http://dague.net

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: OpenPGP digital signature
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140626/711dcd5b/attachment.pgp>


More information about the OpenStack-dev mailing list