[openstack-dev] [nova] fair standards for all hypervisor drivers

Daniel P. Berrange berrange at redhat.com
Wed Jul 16 15:28:54 UTC 2014


On Wed, Jul 16, 2014 at 08:12:47AM -0700, Clark Boylan wrote:
> On Wed, Jul 16, 2014 at 7:50 AM, Daniel P. Berrange <berrange at redhat.com> wrote:
> > On Wed, Jul 16, 2014 at 04:15:40PM +0200, Sean Dague wrote:
> >> Recently the main gate updated from Ubuntu 12.04 to 14.04, and in doing
> >> so we started executing the livesnapshot code in the nova libvirt
> >> driver. Which fails about 20% of the time in the gate, as we're bringing
> >> computes up and down while doing a snapshot. Dan Berange did a bunch of
> >> debug on that and thinks it might be a qemu bug. We disabled these code
> >> paths, so live snapshot has now been ripped out.
> >>
> >> In January we also triggered a libvirt bug, and had to carry a private
> >> build of libvirt for 6 weeks in order to let people merge code in OpenStack.
> >>
> >> We never were able to switch to libvirt 1.1.1 in the gate using the
> >> Ubuntu Cloud Archive during Icehouse development, because it has a
> >> different set of failures that would have prevented people from merging
> >> code.
> >>
> >> Based on these experiences, libvirt version differences seem to be as
> >> substantial as major hypervisor differences.
> >
> > I think that is a pretty dubious conclusion to draw from just a
> > couple of bugs. The reason they really caused pain is that because
> > the CI test system was based on old version for too long. If it
> > were tracking current upstream version of libvirt/KVM we'd have
> > seen the problem much sooner & been able to resolve it during
> > review of the change introducing the feature, as we do with any
> > other bugs we encounter in software such as the breakage we see
> > with my stuff off pypi.
> 
> How do you suggest we do this effectively with libvirt? In the past we
> have tried to use newer versions of libvirt and they completely broke.
> And the time to fixing that was non trivial. For most of our pypi
> stuff we attempt to fix upstream and if that does not happen quickly
> we pin (arguably we don't do this well either, see the sqlalchemy<=0.7
> issues of the past).

The real big problem we had was the firewall deadlock problem. When
I was made aware of that problem I worked on fixing that in upstream
libvirt immediately. IIRC we had a solution in a week or two which
was added to a libvirt stable release update. Much of the further
delay was in waiting for the fixes to make their way into the
Ubuntu repositories. If the gate were ignoring Ubuntu repos and
pulling latest upstream libvirt, then we could have just pinned
to an older libvirt until the fix was pushed out to a stable
libvirt release. The libvirt community release process is flexible
enough to push out priority bug fix releases in a matter of days,
or less,  if needed. So temporarily pinning isn't the end of the
world in that respect.

> I am worried that we would just regress to the current process because
> we have tried something similar to this previously and were forced to
> regress to the current process.

IMHO the longer we wait between updating the gate to new versions
the bigger the problems we create for ourselves. eg we were switching
from 0.9.8 released Dec 2011, to  1.1.1 released Jun 2013, so we
were exposed to over 1 + 1/2 years worth of code churn in a single
event. The fact that we only hit a couple of bugs in that, is actually
remarkable given the amount of feature development that had gone into
libvirt in that time. If we had been tracking each intervening libvirt
release I expect the majority of updates would have had no ill effect
on us at all. For the couple of releases where there was a problem we
would not be forced to rollback to a version years older again, we'd
just drop back to the previous release at most 1 month older.

Ultimately, thanks to us identifying & fixing those previously seen
bugs, we did just switch from 0.9.8 to 1.2.2 which is a 2+1/2 year
jump, and the only problem we've hit is the live snapshot problem
which appears to be a QEMU bug.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|



More information about the OpenStack-dev mailing list