[openstack-dev] [nova][infra][ci] bulk repeating a test job on a single review in parallel ?

Daniel P. Berrange berrange at redhat.com
Thu Jun 30 13:44:36 UTC 2016

A bunch of people in Nova and upstream QEMU teams are trying to investigate
a long standing bug in live migration[1]. Unfortuntely the bug is rather
non-deterministic - eg on the multinode-live-migration tempest job it has
hit 4 times in 7 days, while on multinode-full tempest job it has hit
~70 times in 7 days.

I have a test patch which hacks nova to download & install a special QEMU
build with extra debugging output[2]. Because of the non-determinism I need
to then run the multinode-live-migration & multinode-full tempest jobs
many times to try and catch the bug.  Doing this by just entering 'recheck'
is rather tedious because you have to wait for the 1+ hour turnaround time
between each recheck.

To get around this limitation I created a chain of 10 commits [3] which just
toggled some whitespace and uploaded them all, so I can get 10 CI runs
going in parallel. This worked remarkably well - at least enough to
reproduce the more common failure of multinode-full, but not enough for
the much rarer multinode-live-migration job.

I could expand this hack and upload 100 dummy changes to get more jobs
running to increase chances of hitting the multinode-live-migration
failure. Out of the 16 jobs run on every Nova change, I only care about
running 2 of them. So to get 100 runs of the 2 live migration jobs I want,
I'd be creating 1600 CI jobs in total which is not too nice for our CI
resource pool :-(

I'd really love it if there was

 1. the ability to request checking of just specific jobs eg

      "recheck gate-tempest-dsvm-multinode-full"

 2. the ability to request this recheck to run multiple
    times in parallel. eg if i just repeat the 'recheck'
    command many times on the same patchset # without
    waiting for results

Any one got any other tips for debugging highly non-deterministic
bugs like this which only hit perhaps 1 time in 100, without wasting
huge amounts of CI resource as I'm doing right now ?

No one has ever been able to reproduce these failures outside of
the gate CI infra, indeed certain CI hosting providers seem worse
afffected by the bug than others, so running tempest locally is not
an option.


[1] https://bugs.launchpad.net/nova/+bug/1524898
[2] https://review.openstack.org/#/c/335549/5
[3] https://review.openstack.org/#/q/topic:mig-debug
