[Openstack-operators] PCI passthrough trying to use busy resource?

Jonathan D. Proulx jon at csail.mit.edu
Tue Oct 18 19:05:46 UTC 2016


Answering my own questions a bti faster than I though I could.

nova DB has a pci_devices table.

what happened was there was in intermediate state where the
pci_passthrough_whitelist value on the hypervisor was
missin. apparently during taht time the row for this hypervisor in the
pci_devices table got marked as deleted.  Teh when the nova.conf go
fixed they got recreated (even though the old 'deleted' resources we
really actively in use)

so I end up this colliding state:


> SELECT created_at,deleted_at,deleted,id,compute_node_id,address,status,instance_uuid FROM pci_devices WHERE address='0000:09:00.0';
+---------------------+---------------------+---------+----+-----------------+--------------+-----------+--------------------------------------+
| created_at          | deleted_at          | deleted | id | compute_node_id | address      | status    | instance_uuid                        |
+---------------------+---------------------+---------+----+-----------------+--------------+-----------+--------------------------------------+
| 2016-07-06 00:12:30 | 2016-10-13 21:04:53 |       4 |  4 |              90 | 0000:09:00.0 | allocated | 9269391a-4ce4-4c8d-993d-5ad7a9c3879b |
| 2016-10-18 18:01:35 | NULL                |       0 | 12 |              90 | 0000:09:00.0 | available | NULL                                 |
+---------------------+---------------------+---------+----+-----------------+--------------+-----------+--------------------------------------+

since it's only really 3 entries I can fix this by hand then head over
to bug report land.

-Jon


On Tue, Oct 18, 2016 at 02:50:11PM -0400, Jonathan D. Proulx wrote:
:Hi all,
:
:I have a test GPU system that seemed to be working properly under Kilo
:running 1 and 2 GPU instnace types on an 8GPU server.
:
:After Mitaka upgrade it seems to alway try and assing the same Device
:which is alredy in use rather than pick one of the 5 currently
:available.
:
:
: Build of instance 9542cc63-793c-440e-9a57-cc06eb401839 was
: re-scheduled: Requested operation is not valid: PCI device
: 0000:09:00.0 is in use by driver QEMU, domain instance-000abefa
: _do_build_and_run_instance
: /usr/lib/python2.7/dist-packages/nova/compute/manager.py:1945
:
:it tries to schedule 5 times, but each time uses the same busy
:device.  Since there are currently 3 in use if it had just picked a
:new one each time
:
:In trying to debug this I realize I have no idea how devices are
:selected. Does OpenStack track which PCI devices are claimed or is
:that a libvirt function and in either case where woudl I look to find
:out what it thinks the current state is?
:
:Thanks,
:-Jon
:-- 

-- 



More information about the OpenStack-operators mailing list