[openstack-dev] [ironic] [infra] Nested KVM + the gate

Clark Boylan cboylan at sapwetik.org
Tue Jan 17 23:51:49 UTC 2017


On Tue, Jan 17, 2017, at 03:41 PM, Jay Faulkner wrote:
> Hi all,
> 
> Back in late October, Vasyl wrote support for devstack to auto detect,
> and when possible, use kvm to power Ironic gate jobs
> (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job
> time when it works, but has caused failures — how many? It’s hard to
> quantify as the log messages that show the error don’t appear to be
> indexed by elastic search. It’s something seen often enough that the
> issue has become a permanent staple on our gate whiteboard, and doesn’t
> appear to be decreasing in quantity.
> 
> I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps
> the auto detection behavior, but defaults devstack to use qemu emulation
> instead of kvm.
> 
> I have two questions:
> 1) Is there any way I’m not aware of we can quantify the number of
> failures this is causing? The key log message, "KVM: entry failed,
> hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz.
> 2) Are these failures avoidable or visible in any way?
> 
> IMO, if we can’t fix these failures, in my opinion, we have to do a
> change to avoid using nested KVM altogether. Lower reliability for our
> jobs is not worth a small decrease in job run time.

Part of the problem with nested KVM failures is that in many cases they
destroy the test nodes in unrecoverable ways. In which case you don't
get any logs, and zuul will restart the job for you. I think that
graphite will capture this as a job that resulted in a Null/None status
though (rather than SUCCESS/FAILURE).

As for collecting info when you do get logs, we don't index the libvirt
instance logs currently and I am not sure we want to. We already
struggle to keep up with the existing set of logs when we are busy.
Instead we might have job cleanup do a quick grep for known nested virt
problem indicators and then log that to the console log which will be
indexed.

I think trove has also seen kernel panic type errors in syslog that we
hypothesized were a result of using nested virt.

The infra team explicitly attempts to force qemu instead of kvm on jobs
using devstack-gate for these reasons. We know it doesn't work reliably
and not all clouds support it. Unfortunately my understanding of the
situation is that base hypervisor cpu and kernel, second level
hypervisor kernel, and nested guest kernel all come into play here. And
there can be nasty interactions between them causing a variety of
problems.

Put another way:

2017-01-14T00:42:00  <mnaser> if we're talking nested kvm
2017-01-14T00:42:04  <mnaser> it's kindof a nightmare
from
http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-01-14.log

Clark



More information about the OpenStack-dev mailing list