Open Stack

Thu Jun 30 20:42:33 UTC 2016

On 6/30/2016 8:44 AM, Daniel P. Berrange wrote:
> A bunch of people in Nova and upstream QEMU teams are trying to investigate
> a long standing bug in live migration[1]. Unfortuntely the bug is rather
> non-deterministic - eg on the multinode-live-migration tempest job it has
> hit 4 times in 7 days, while on multinode-full tempest job it has hit
> ~70 times in 7 days.

For those that don't know, the multinode-live-migration job only runs 
the live migration tests in Tempest and it runs on ubuntu 16.04 nodes 
while the multinode-full job runs the live migration tests plus the 
normal tempest full job run, but on ubuntu 14.04 nodes. So the 
mn-live-migration job *may* be a bit more stable because it's running 
with newer libvirt/qemu.

We've at least noticed that another live migration bug isn't showing up 
on the dedicated xenial live migration job:

https://bugs.launchpad.net/nova/+bug/1539271

>
> I have a test patch which hacks nova to download & install a special QEMU
> build with extra debugging output[2]. Because of the non-determinism I need
> to then run the multinode-live-migration & multinode-full tempest jobs
> many times to try and catch the bug.  Doing this by just entering 'recheck'
> is rather tedious because you have to wait for the 1+ hour turnaround time
> between each recheck.
>
> To get around this limitation I created a chain of 10 commits [3] which just
> toggled some whitespace and uploaded them all, so I can get 10 CI runs
> going in parallel. This worked remarkably well - at least enough to
> reproduce the more common failure of multinode-full, but not enough for
> the much rarer multinode-live-migration job.

The ascii art is a real treat.

>
> I could expand this hack and upload 100 dummy changes to get more jobs
> running to increase chances of hitting the multinode-live-migration
> failure. Out of the 16 jobs run on every Nova change, I only care about
> running 2 of them. So to get 100 runs of the 2 live migration jobs I want,
> I'd be creating 1600 CI jobs in total which is not too nice for our CI
> resource pool :-(
>
> I'd really love it if there was
>
>  1. the ability to request checking of just specific jobs eg
>
>       "recheck gate-tempest-dsvm-multinode-full"

FWIW people have asked for this before. I think it would be OK if there 
were a way to not change the overall verification score somehow because 
it could potentially invalidate earlier runs where multiple jobs failed 
but you're only rechecking one of them.

>
>  2. the ability to request this recheck to run multiple
>     times in parallel. eg if i just repeat the 'recheck'
>     command many times on the same patchset # without
>     waiting for results
>
> Any one got any other tips for debugging highly non-deterministic
> bugs like this which only hit perhaps 1 time in 100, without wasting
> huge amounts of CI resource as I'm doing right now ?
>
> No one has ever been able to reproduce these failures outside of
> the gate CI infra, indeed certain CI hosting providers seem worse
> afffected by the bug than others, so running tempest locally is not
> an option.

Good point on the node providers, I hadn't noticed that before, but it 
definitely looks to be hitting OVH and OSIC nodes more than any others:

http://goo.gl/f0coZb

>
> Regards,
> Daniel
>
> [1] https://bugs.launchpad.net/nova/+bug/1524898
> [2] https://review.openstack.org/#/c/335549/5
> [3] https://review.openstack.org/#/q/topic:mig-debug
>

-- 

Thanks,

Matt Riedemann

Open Stack

[openstack-dev] [nova][infra][ci] bulk repeating a test job on a single review in parallel ?

OpenStack

Community

Documentation

Branding & Legal