[Openstack] [nova] Database not delete PCI info after device is removed from host and nova.conf

Eddie Yen missile0407 at gmail.com
Fri Jul 7 02:37:21 UTC 2017


Hi Jay,

Below are few logs and information you may want to check.



I wrote GPU inforamtion into nova.conf like this.

pci_passthrough_whitelist = [{ "product_id":"0ff3", "vendor_id":"10de" }, {
"product_id":"68c8", "vendor_id":"1002" }]

pci_alias = [{ "product_id":"0ff3", "vendor_id":"10de", "device_type":
"type-PCI", "name":"k420" }, { "product_id":"68c8", "vendor_id":"1002",
"device_type":"type-PCI", "name":"v4800" }]

Then restart the services.

nova-compute log when insert new GPU device info into nova.conf and restart
service:
http://paste.openstack.org/show/z015rYGXaxYhVoafKdbx/

Strange is, the log shows that resource tracker only collect information of
new setup GPU, not included the old one.


But If I do some actions on the instance contained old GPU, the tracker
will get both GPU.
http://paste.openstack.org/show/614658/

Nova database shows correct information on both GPU
http://paste.openstack.org/show/8JS0i6BMitjeBVRJTkRo/



Now remove ID "1002:68c8" from nova.conf and compute node, and restart
services.

The pci_passthrough_whitelist and pci_alias only keep "10de:0ff3" GPU info.

pci_passthrough_whitelist = { "product_id":"0ff3", "vendor_id":"10de" }

pci_alias = { "product_id":"0ff3", "vendor_id":"10de", "device_type":
"type-PCI", "name":"k420" }

nova-compute log shows resource tracker report node only have "10de:0ff3"
PCI resource
http://paste.openstack.org/show/VjLinsipne5nM8o0TYcJ/

But in Nova database, "1002:68c8" still exist, and stayed in "Available"
status. Even "deleted" value shows not zero.
http://paste.openstack.org/show/SnJ8AzJYD6wCo7jslIc2/


Many thanks,
Eddie.

2017-07-07 9:05 GMT+08:00 Eddie Yen <missile0407 at gmail.com>:

> Uh wait,
>
> Is that possible it still shows available if PCI device still exist in the
> same address?
>
> Because when I remove the GPU card, I replace it to a SFP+ network card in
> the same slot.
> So when I type lspci the SFP+ card stay in the same address.
>
> But it still doesn't make any sense because these two cards definitely not
> a same VID:PID.
> And I set the information as VID:PID in nova.conf
>
>
> I'll try reproduce this issue and put a log on this list.
>
> Thanks,
>
> 2017-07-07 9:01 GMT+08:00 Jay Pipes <jaypipes at gmail.com>:
>
>> Hmm, very odd indeed. Any way you can save the nova-compute logs from
>> when you removed the GPU and restarted the nova-compute service and paste
>> those logs to paste.openstack.org? Would be useful in tracking down this
>> buggy behaviour...
>>
>> Best,
>> -jay
>>
>> On 07/06/2017 08:54 PM, Eddie Yen wrote:
>>
>>> Hi Jay,
>>>
>>> The status of the "removed" GPU still shows as "Available" in
>>> pci_devices table.
>>>
>>> 2017-07-07 8:34 GMT+08:00 Jay Pipes <jaypipes at gmail.com <mailto:
>>> jaypipes at gmail.com>>:
>>>
>>>
>>>     Hi again, Eddie :) Answer inline...
>>>
>>>     On 07/06/2017 08:14 PM, Eddie Yen wrote:
>>>
>>>         Hi everyone,
>>>
>>>         I'm using OpenStack Mitaka version (deployed from Fuel 9.2)
>>>
>>>         In present, I installed two different model of GPU card.
>>>
>>>         And wrote these information into pci_alias and
>>>         pci_passthrough_whitelist in nova.conf on Controller and Compute
>>>         (the node which installed GPU).
>>>         Then restart nova-api, nova-scheduler,and nova-compute.
>>>
>>>         When I check database, both of GPU info registered in
>>>         pci_devices table.
>>>
>>>         Now I removed one of the GPU from compute node, and remove the
>>>         information from nova.conf, then restart services.
>>>
>>>         But I check database again, the information of the removed card
>>>         still exist in pci_devices table.
>>>
>>>         How can I do to fix this problem?
>>>
>>>
>>>     So, when you removed the GPU from the compute node and restarted the
>>>     nova-compute service, it *should* have noticed you had removed the
>>>     GPU and marked that PCI device as deleted. At least, according to
>>>     this code in the PCI manager:
>>>
>>>     https://github.com/openstack/nova/blob/master/nova/pci/manag
>>> er.py#L168-L183
>>>     <https://github.com/openstack/nova/blob/master/nova/pci/mana
>>> ger.py#L168-L183>
>>>
>>>     Question for you: what is the value of the status field in the
>>>     pci_devices table for the GPU that you removed?
>>>
>>>     Best,
>>>     -jay
>>>
>>>     p.s. If you really want to get rid of that device, simply remove
>>>     that record from the pci_devices table. But, again, it *should* be
>>>     removed automatically...
>>>
>>>     _______________________________________________
>>>     Mailing list:
>>>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>>     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>     Post to     : openstack at lists.openstack.org
>>>     <mailto:openstack at lists.openstack.org>
>>>     Unsubscribe :
>>>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>>>     <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack>
>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20170707/9005a769/attachment.html>


More information about the Openstack mailing list