[openstack-dev] [nova] Intel NFV CI failing all shelve/unshelve tests
Chris Friesen
chris.friesen at windriver.com
Wed May 25 16:39:37 UTC 2016
On 05/22/2016 05:41 PM, Jay Pipes wrote:
> Hello Novaites,
>
> I've noticed that the Intel NFV CI has been failing all test runs for quite some
> time (at least a few days), always failing the same tests around shelve/unshelve
> operations.
<snip>
> I looked through the conductor and compute logs to see if I could find any
> possible reasons for the errors and found a number of the following errors in
> the compute logs:
>
> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
> cae6fd47-0968-4922-a03e-3f2872e4eb52] Traceback (most recent call last):
> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
> cae6fd47-0968-4922-a03e-3f2872e4eb52] File
> "/opt/stack/new/nova/nova/compute/manager.py", line 4230, in _unshelve_instance
> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
> cae6fd47-0968-4922-a03e-3f2872e4eb52] with rt.instance_claim(context,
> instance, limits):
<snip>
> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
> cae6fd47-0968-4922-a03e-3f2872e4eb52] newcell.unpin_cpus(pinned_cpus)
> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
> cae6fd47-0968-4922-a03e-3f2872e4eb52] File
> "/opt/stack/new/nova/nova/objects/numa.py", line 94, in unpin_cpus
> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
> cae6fd47-0968-4922-a03e-3f2872e4eb52] pinned=list(self.pinned_cpus))
> 2016-05-22 22:18:59.403 8145 ERROR nova.compute.manager [instance:
> cae6fd47-0968-4922-a03e-3f2872e4eb52] CPUPinningInvalid: Cannot pin/unpin cpus
> [6] from the following pinned set [0, 2, 4]
>
> on or around the time of the failures in Tempest.
>
> Perhaps tomorrow morning we can look into handling the above exception properly
> from the compute manager, since clearly we shouldn't be allowing
> CPUPinningInvalid to be raised in the resource tracker's _update_usage() call....
First, it seems wrong to me that an _unshelve_instance() call would result in
unpinning any CPUs. If the instance was using pinned CPUs then I would expect
the CPUs to be unpinned when doing the "shelve" operation. When we do an
instance claim as part of the "unshelve" operation we should be pinning CPUs,
not unpinning them.
Second, the reason why CPUPinningInvalid gets raised in _update_usage() is that
it has discovered an inconsistency in its view of resources. In this case, it's
trying to unpin CPU 6 from a set of pinned cpus that doesn't include CPU 6. I
think this is a valid concern and should result in an error log. Whether it
should cause the unshelve operation to fail is a separate question, but it's
definitely a symptom that something is wrong with resource tracking on this
compute node.
Chris
More information about the OpenStack-dev
mailing list