<div dir="ltr">On Tue, Jan 17, 2017 at 6:41 PM, Jay Faulkner <span dir="ltr"><<a href="mailto:jay@jvf.cc" target="_blank">jay@jvf.cc</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

<br>

Back in late October, Vasyl wrote support for devstack to auto detect, and when possible, use kvm to power Ironic gate jobs (<wbr>0036d83b330d98e64d656b156001dd<wbr>2209ab1903). This has lowered some job time when it works, but has caused failures — how many? It’s hard to quantify as the log messages that show the error don’t appear to be indexed by elastic search. It’s something seen often enough that the issue has become a permanent staple on our gate whiteboard, and doesn’t appear to be decreasing in quantity.<br>

<br>

I pushed up a patch, <a href="https://review.openstack.org/#/c/421581" rel="noreferrer" target="_blank">https://review.openstack.org/#<wbr>/c/421581</a>, which keeps the auto detection behavior, but defaults devstack to use qemu emulation instead of kvm.<br>

<br>

I have two questions:<br>

1) Is there any way I’m not aware of we can quantify the number of failures this is causing? The key log message, "KVM: entry failed, hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.<wbr>gz.<br>

2) Are these failures avoidable or visible in any way?<br>

<br>

IMO, if we can’t fix these failures, in my opinion, we have to do a change to avoid using nested KVM altogether. Lower reliability for our jobs is not worth a small decrease in job run time.<br></blockquote><div><br></div><div>+2, especially this late in the cycle, we need our CI to be rock solid.<br><br></div><div>// jim</div><br></div><br></div></div>