[openstack-dev] [nova] top gate bug is libvirt snapshot

Kashyap Chamarthy kchamart at redhat.com
Wed Jul 9 12:17:47 UTC 2014


On Tue, Jul 08, 2014 at 06:21:31PM -0400, Sean Dague wrote:
> On 07/08/2014 06:12 PM, Joe Gordon wrote:
> > 
> > 
> > 
> > On Tue, Jul 8, 2014 at 2:56 PM, Michael Still <mikal at stillhq.com
> > <mailto:mikal at stillhq.com>> wrote:
> > 
> >     The associated bug says this is probably a qemu bug, so I think we
> >     should rephrase that to "we need to start thinking about how to make
> >     sure upstream changes don't break nova".
> > 
> > 
> > Good point.
> >  
> > 
> > Would running devstack-tempest on the latest upstream release of ? help.
> > Not as a voting job but as a periodic (third party?) job, that we can
> > hopefully identify these issues early on. I think the big question here
> > is who would volunteer to help run a job like this.

Although, I'm familiar with Gate and infra in depth, I can help
volunteer debug such issues (as I try to test libvirt/QEMU upstreams and
from git quite frequently).

> The running of the job really isn't the issue.
> 
> It's the debugging of the jobs when the go wrong. Creating a new test
> job and getting it lit is really < 10% of the work, sifting through the
> fails and getting to the bottom of things is the hard and time consuming
> part.

Very true. For instance -- the live snapshot issue[1], I wish we could
get to the logical end of it (without letting it languish) and enable it
back in Nova soon. But, as of now, we're not able to pin point the
root cause and it's not reproducible any more from Dan Berrange's
detailed analysis after a week of tests outside the Gate or tests 
with some debugging enabled[2] when there's a light load on the Gate --
both cases, he didn't hit the issue after multiple test runs.

Dan raised on #openstack-nova if there might be some  weird I/O issue in
HP cloud that's leading to these timeouts, but Sean said  timeout would
be an issue only if this (the test in question) take 2 minutes some
times and succeed.

FWIW, from my local tests of exact Nova invocation of libvirt
blockRebase API to do parallel blockcopy operations followed by an
explicit abort (to gracefully end the block operation), I couldn't
reproduce it on multiple runs either.

 
  [1] https://bugs.launchpad.net/nova/+bug/1334398 -- libvirt
      live_snapshot periodically explodes on libvirt 1.2.2 in the gate
  [2] https://review.openstack.org/#/c/103066/
 
> 
> The other option is to remove more concurrency from nova-compute. It's
> pretty clear that this problem only seems to happen when the
> snapshotting is going on at the same time guests are being created or
> destroyed (possibly also a second snapshot going on).
> 
> This is also why I find it unlikely to be a qemu bug, because that's not
> shared state between guests. If qemu just randomly wedges itself, that
> would be detectable much easier outside of the gate. And there have been
> attempts by danpb to sniff that out, and they haven't worked.
> 
> 	-Sean
> 


-- 
/kashyap



More information about the OpenStack-dev mailing list