[openstack-dev] [nova] top gate bug is libvirt snapshot

Daniel P. Berrange berrange at redhat.com
Wed Jul 9 12:43:26 UTC 2014


On Wed, Jul 09, 2014 at 08:34:06AM -0400, Sean Dague wrote:
> On 07/09/2014 03:58 AM, Daniel P. Berrange wrote:
> > On Tue, Jul 08, 2014 at 02:50:40PM -0700, Joe Gordon wrote:
> >>>> But for right now, we should stop the bleeding, so that nova/libvirt
> >>>> isn't blocking everyone else from merging code.
> >>>
> >>> Agreed, we should merge the hack and treat the bug as release blocker
> >>> to be resolve prior to Juno GA.
> >>>
> >>
> >>
> >> How can we prevent libvirt issues like this from landing in trunk in the
> >> first place? If we don't figure out a way to prevent this from landing the
> >> first place I fear we will keep repeating this same pattern of failure.
> 
> Right, this is where math is against us. If a race shows up 1% of the
> time, you need 66 runs to have a 50% of seeing it. I still haven't
> calibrated the bugs to an absolute scale, but I think based on what I
> remember this livesnapshot bug was probably a 3-4% bug (per Tempest
> run). So you'd need 50 Tempest runs to have an 80% to see it show up again.
> 
> (Absolute calibration of the bugs is on my todo list for Elastic
> Recheck, maybe it's time to put that in front of fixing the bugs)
> 
> > Realistically I don't think there was much/any chance of avoiding this
> > problem. Despite many days of work trying to reproduce it by multiple
> > people, no one has managed even 1 single failure outside of the gate.
> > Even inside the gate it is hard to reproduce. I still have absolutely
> > no clue what is failing after days of investigation & debugging with
> > all the tricks I can think of, because as I say, it works perfectly
> > every time I try it, except in the gate where it is impossible to
> > debug it.
> 
> Out of curiosity, is your reproduce using eventlet? My expectation is
> that eventlet's concurency actually exacerbates this because when the
> snapshot starts we're now doing IO, and that means it's exactly the time
> that other compute work will be triggered.

I've tried both running the tempest suite itself, and also running
a dedicated stress test written against libvirt snapshot APIs in C.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|



More information about the OpenStack-dev mailing list