[openstack-dev] [ironic] [infra] Nested KVM + the gate

Amrith Kumar amrith.kumar at gmail.com
Wed Jan 18 02:24:26 UTC 2017


Clark is right, trove does detect and try to use kvm where possible. The
performance has been well worth the change (IMHO).

-amrith

On Jan 17, 2017 6:53 PM, "Clark Boylan" <cboylan at sapwetik.org> wrote:

> On Tue, Jan 17, 2017, at 03:41 PM, Jay Faulkner wrote:
> > Hi all,
> >
> > Back in late October, Vasyl wrote support for devstack to auto detect,
> > and when possible, use kvm to power Ironic gate jobs
> > (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job
> > time when it works, but has caused failures — how many? It’s hard to
> > quantify as the log messages that show the error don’t appear to be
> > indexed by elastic search. It’s something seen often enough that the
> > issue has become a permanent staple on our gate whiteboard, and doesn’t
> > appear to be decreasing in quantity.
> >
> > I pushed up a patch, https://review.openstack.org/#/c/421581, which
> keeps
> > the auto detection behavior, but defaults devstack to use qemu emulation
> > instead of kvm.
> >
> > I have two questions:
> > 1) Is there any way I’m not aware of we can quantify the number of
> > failures this is causing? The key log message, "KVM: entry failed,
> > hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz.
> > 2) Are these failures avoidable or visible in any way?
> >
> > IMO, if we can’t fix these failures, in my opinion, we have to do a
> > change to avoid using nested KVM altogether. Lower reliability for our
> > jobs is not worth a small decrease in job run time.
>
> Part of the problem with nested KVM failures is that in many cases they
> destroy the test nodes in unrecoverable ways. In which case you don't
> get any logs, and zuul will restart the job for you. I think that
> graphite will capture this as a job that resulted in a Null/None status
> though (rather than SUCCESS/FAILURE).
>
> As for collecting info when you do get logs, we don't index the libvirt
> instance logs currently and I am not sure we want to. We already
> struggle to keep up with the existing set of logs when we are busy.
> Instead we might have job cleanup do a quick grep for known nested virt
> problem indicators and then log that to the console log which will be
> indexed.
>
> I think trove has also seen kernel panic type errors in syslog that we
> hypothesized were a result of using nested virt.
>
> The infra team explicitly attempts to force qemu instead of kvm on jobs
> using devstack-gate for these reasons. We know it doesn't work reliably
> and not all clouds support it. Unfortunately my understanding of the
> situation is that base hypervisor cpu and kernel, second level
> hypervisor kernel, and nested guest kernel all come into play here. And
> there can be nasty interactions between them causing a variety of
> problems.
>
> Put another way:
>
> 2017-01-14T00:42:00  <mnaser> if we're talking nested kvm
> 2017-01-14T00:42:04  <mnaser> it's kindof a nightmare
> from
> http://eavesdrop.openstack.org/irclogs/%23openstack-
> infra/%23openstack-infra.2017-01-14.log
>
> Clark
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170117/c36a18b7/attachment.html>


More information about the OpenStack-dev mailing list