[Openstack] [nova] Database not delete PCI info after device is removed from host and nova.conf

Eddie Yen missile0407 at gmail.com
Tue Jul 11 01:22:48 UTC 2017


Oops, I just report this issue on the Launchpad at last moment.

Thanks Moshe, I'll try this commit.

2017-07-11 9:13 GMT+08:00 Moshe Levi <moshele at mellanox.com>:

> Hi Eddie,
>
>
>
>
> Looking on the your nova database after the delete looks correct to me.
>
> | created_at          | updated_at          | deleted_at          | deleted | id
>
> | 2017-06-21 00:56:06 | 2017-07-07 02:27:16 | NULL                |       0 |  2
>
> | 2017-07-07 01:42:48 | 2017-07-07 02:13:14 | 2017-07-07 02:13:42 |       9 |  9
>
> See that the second row has deleted_at timestamp  and deleted with no zero
> value (the id of the row). Nova is doing soft delete which is just marking
> the row as deleted but not actually deleting it from nova pci_devices table.
>
> See [1] and [2]
>
>
>
> There is a bug with the pci_devices in a scenario  when we can delete
> allocated pci device e.g. if pci.passthrough_whitelist is changed  commit
> [3] try to resolve.
>
>
>
>
>
> [1] - https://github.com/openstack/oslo.db/blob/master/oslo_db/
> sqlalchemy/models.py#L142-L150
>
> [2] - https://github.com/openstack/nova/blob/master/nova/db/
> sqlalchemy/models.py#L1411
>
> [3-] - https://review.openstack.org/#/c/426243/
>
>
>
> *From:* Eddie Yen [mailto:missile0407 at gmail.com]
> *Sent:* Tuesday, July 11, 2017 3:18 AM
> *To:* Jay Pipes <jaypipes at gmail.com>
> *Cc:* openstack at lists.openstack.org
> *Subject:* Re: [Openstack] [nova] Database not delete PCI info after
> device is removed from host and nova.conf
>
>
>
> Roger that,
>
>
>
> I may going to report this bug on the OpenStack Compute (Nova) Launchpad
> to see what happen.
>
> Anyway, thanks for ur help, really appreciate.
>
>
> Eddie.
>
>
>
> 2017-07-11 8:12 GMT+08:00 Jay Pipes <jaypipes at gmail.com>:
>
> Unfortunately, Eddie, I'm not entirely sure what is going on with your
> situation. According to the code, the non-existing PCI device should be
> removed from the pci_devices table when the PCI manager notices the PCI
> device is no longer on the local host...
>
> On 07/09/2017 08:36 PM, Eddie Yen wrote:
>
> Hi there,
>
> Does the information already enough or need additional items?
>
> Thanks,
> Eddie.
>
> 2017-07-07 10:49 GMT+08:00 Eddie Yen <missile0407 at gmail.com <mailto:
> missile0407 at gmail.com>>:
>
>     Sorry,
>
>     Re-new the nova-compute log after remove "1002:68c8" and restart
>     nova-compute.
>     http://paste.openstack.org/show/qUCOX09jyeMydoYHc8Oz/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2FqUCOX09jyeMydoYHc8Oz%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098573075&sdata=brxkAv3AgO%2BwpwPXow5SY%2By0rGZ%2B7STTbEfm3gH1KSM%3D&reserved=0>
>     <http://paste.openstack.org/show/qUCOX09jyeMydoYHc8Oz/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2FqUCOX09jyeMydoYHc8Oz%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098573075&sdata=brxkAv3AgO%2BwpwPXow5SY%2By0rGZ%2B7STTbEfm3gH1KSM%3D&reserved=0>
> >
>
>     2017-07-07 10:37 GMT+08:00 Eddie Yen <missile0407 at gmail.com
>     <mailto:missile0407 at gmail.com>>:
>
>
>
>         Hi Jay,
>
>         Below are few logs and information you may want to check.
>
>
>
>         I wrote GPU inforamtion into nova.conf like this.
>
>         pci_passthrough_whitelist = [{ "product_id":"0ff3",
>         "vendor_id":"10de"}, { "product_id":"68c8", "vendor_id":"1002"}]
>
>         pci_alias = [{ "product_id":"0ff3", "vendor_id":"10de",
>         "device_type":"type-PCI", "name":"k420"}, { "product_id":"68c8",
>         "vendor_id":"1002", "device_type":"type-PCI", "name":"v4800"}]
>
>
>         Then restart the services.
>
>         nova-compute log when insert new GPU device info into nova.conf
>         and restart service:
>         http://paste.openstack.org/show/z015rYGXaxYhVoafKdbx/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2Fz015rYGXaxYhVoafKdbx%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=Jc1%2B7Uexui%2FFfEL%2FdADTp6tVa9ssIBPGabGwA85Qm2E%3D&reserved=0>
>         <http://paste.openstack.org/show/z015rYGXaxYhVoafKdbx/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2Fz015rYGXaxYhVoafKdbx%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=Jc1%2B7Uexui%2FFfEL%2FdADTp6tVa9ssIBPGabGwA85Qm2E%3D&reserved=0>
> >
>
>         Strange is, the log shows that resource tracker only collect
>         information of new setup GPU, not included the old one.
>
>
>         But If I do some actions on the instance contained old GPU, the
>         tracker will get both GPU.
>         http://paste.openstack.org/show/614658/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2F614658%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=EvEVi1mhEAbVLK7NQppVJX8i7aqkgCtwbH8GRFr81Fo%3D&reserved=0>
>         <http://paste.openstack.org/show/614658/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2F614658%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=EvEVi1mhEAbVLK7NQppVJX8i7aqkgCtwbH8GRFr81Fo%3D&reserved=0>
> >
>
>         Nova database shows correct information on both GPU
>         http://paste.openstack.org/show/8JS0i6BMitjeBVRJTkRo/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2F8JS0i6BMitjeBVRJTkRo%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=V%2BFxNgTY2N3hDU6gK31axnLCf1bvz7B7Lw%2FmqY%2BrhT8%3D&reserved=0>
>         <http://paste.openstack.org/show/8JS0i6BMitjeBVRJTkRo/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2F8JS0i6BMitjeBVRJTkRo%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=V%2BFxNgTY2N3hDU6gK31axnLCf1bvz7B7Lw%2FmqY%2BrhT8%3D&reserved=0>
> >
>
>
>
>         Now remove ID "1002:68c8" from nova.conf and compute node, and
>         restart services.
>
>         The pci_passthrough_whitelist and pci_alias only keep
>         "10de:0ff3" GPU info.
>
>         pci_passthrough_whitelist = { "product_id":"0ff3",
>         "vendor_id":"10de" }
>
>         pci_alias = { "product_id":"0ff3", "vendor_id":"10de",
>         "device_type":"type-PCI", "name":"k420" }
>
>
>         nova-compute log shows resource tracker report node only have
>         "10de:0ff3" PCI resource
>         http://paste.openstack.org/show/VjLinsipne5nM8o0TYcJ/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2FVjLinsipne5nM8o0TYcJ%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=GmWsrHuv1DphNJXSKDils8iUWn%2BKbeihjmfDQHQHOMY%3D&reserved=0>
>         <http://paste.openstack.org/show/VjLinsipne5nM8o0TYcJ/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2FVjLinsipne5nM8o0TYcJ%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=GmWsrHuv1DphNJXSKDils8iUWn%2BKbeihjmfDQHQHOMY%3D&reserved=0>
> >
>
>         But in Nova database, "1002:68c8" still exist, and stayed in
>         "Available" status. Even "deleted" value shows not zero.
>         http://paste.openstack.org/show/SnJ8AzJYD6wCo7jslIc2/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2FSnJ8AzJYD6wCo7jslIc2%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=9bzrGFdYVtAtwKdTu0ZaxegUah3ZTBbNqAGjCrsT9lk%3D&reserved=0>
>         <http://paste.openstack.org/show/SnJ8AzJYD6wCo7jslIc2/
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org%2Fshow%2FSnJ8AzJYD6wCo7jslIc2%2F&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=9bzrGFdYVtAtwKdTu0ZaxegUah3ZTBbNqAGjCrsT9lk%3D&reserved=0>
> >
>
>
>         Many thanks,
>         Eddie.
>
>         2017-07-07 9:05 GMT+08:00 Eddie Yen <missile0407 at gmail.com
>
>         <mailto:missile0407 at gmail.com>>:
>
>             Uh wait,
>
>             Is that possible it still shows available if PCI device
>             still exist in the same address?
>
>             Because when I remove the GPU card, I replace it to a SFP+
>             network card in the same slot.
>             So when I type lspci the SFP+ card stay in the same address.
>
>             But it still doesn't make any sense because these two cards
>             definitely not a same VID:PID.
>             And I set the information as VID:PID in nova.conf
>
>
>             I'll try reproduce this issue and put a log on this list.
>
>             Thanks,
>
>             2017-07-07 9:01 GMT+08:00 Jay Pipes <jaypipes at gmail.com
>             <mailto:jaypipes at gmail.com>>:
>
>                 Hmm, very odd indeed. Any way you can save the
>                 nova-compute logs from when you removed the GPU and
>                 restarted the nova-compute service and paste those logs
>                 to paste.openstack.org
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=%2B6nouVdZuiGwaywLl%2BYGqbqDIbZZIjagLykv6%2BEYrf8%3D&reserved=0>
> <http://paste.openstack.org
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpaste.openstack.org&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=%2B6nouVdZuiGwaywLl%2BYGqbqDIbZZIjagLykv6%2BEYrf8%3D&reserved=0>
> >?
>                 Would be useful in tracking down this buggy behaviour...
>
>                 Best,
>                 -jay
>
>                 On 07/06/2017 08:54 PM, Eddie Yen wrote:
>
>                     Hi Jay,
>
>                     The status of the "removed" GPU still shows as
>                     "Available" in pci_devices table.
>
>                     2017-07-07 8:34 GMT+08:00 Jay Pipes
>                     <jaypipes at gmail.com <mailto:jaypipes at gmail.com>
>                     <mailto:jaypipes at gmail.com
>
>
>                     <mailto:jaypipes at gmail.com>>>:
>
>
>                          Hi again, Eddie :) Answer inline...
>
>                          On 07/06/2017 08:14 PM, Eddie Yen wrote:
>
>                              Hi everyone,
>
>                              I'm using OpenStack Mitaka version
>                     (deployed from Fuel 9.2)
>
>                              In present, I installed two different model
>                     of GPU card.
>
>                              And wrote these information into pci_alias and
>                              pci_passthrough_whitelist in nova.conf on
>                     Controller and Compute
>                              (the node which installed GPU).
>                              Then restart nova-api, nova-scheduler,and
>                     nova-compute.
>
>                              When I check database, both of GPU info
>                     registered in
>                              pci_devices table.
>
>                              Now I removed one of the GPU from compute
>                     node, and remove the
>                              information from nova.conf, then restart
>                     services.
>
>                              But I check database again, the information
>                     of the removed card
>                              still exist in pci_devices table.
>
>                              How can I do to fix this problem?
>
>
>                          So, when you removed the GPU from the compute
>                     node and restarted the
>                          nova-compute service, it *should* have noticed
>                     you had removed the
>                          GPU and marked that PCI device as deleted. At
>                     least, according to
>                          this code in the PCI manager:
>
>                     https://github.com/openstack/
> nova/blob/master/nova/pci/manager.py#L168-L183
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fblob%2Fmaster%2Fnova%2Fpci%2Fmanager.py%23L168-L183&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=CYb%2Fec5fiAkU9LfJ7W6eMxXsS%2F2VpdfaVYSAdcGRy94%3D&reserved=0>
>                     <https://github.com/openstack/
> nova/blob/master/nova/pci/manager.py#L168-L183
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fblob%2Fmaster%2Fnova%2Fpci%2Fmanager.py%23L168-L183&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=CYb%2Fec5fiAkU9LfJ7W6eMxXsS%2F2VpdfaVYSAdcGRy94%3D&reserved=0>
> >
>                                             <https://github.com/openstack/
> nova/blob/master/nova/pci/manager.py#L168-L183
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fblob%2Fmaster%2Fnova%2Fpci%2Fmanager.py%23L168-L183&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=CYb%2Fec5fiAkU9LfJ7W6eMxXsS%2F2VpdfaVYSAdcGRy94%3D&reserved=0>
>                     <https://github.com/openstack/
> nova/blob/master/nova/pci/manager.py#L168-L183
> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenstack%2Fnova%2Fblob%2Fmaster%2Fnova%2Fpci%2Fmanager.py%23L168-L183&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=CYb%2Fec5fiAkU9LfJ7W6eMxXsS%2F2VpdfaVYSAdcGRy94%3D&reserved=0>
> >>
>
>                          Question for you: what is the value of the
>                     status field in the
>                          pci_devices table for the GPU that you removed?
>
>                          Best,
>                          -jay
>
>                          p.s. If you really want to get rid of that
>                     device, simply remove
>                          that record from the pci_devices table. But,
>                     again, it *should* be
>                          removed automatically...
>
>                          _______________________________________________
>                          Mailing list:
>                     http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openstack.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fopenstack&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=ZpzNaE0Wra4KGRWcluDSyq9lIWTjcOa%2F0uEzllZ6ofI%3D&reserved=0>
>                     <http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openstack.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fopenstack&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098583083&sdata=ZpzNaE0Wra4KGRWcluDSyq9lIWTjcOa%2F0uEzllZ6ofI%3D&reserved=0>
> >
>                                             <http://lists.openstack.org/
> cgi-bin/mailman/listinfo/openstack
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openstack.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fopenstack&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098593092&sdata=EM1gsCu55xLMlaPGl5QumwnCR%2FEfgNEEF3GpXOCDshE%3D&reserved=0>
>                     <http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openstack.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fopenstack&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098593092&sdata=EM1gsCu55xLMlaPGl5QumwnCR%2FEfgNEEF3GpXOCDshE%3D&reserved=0>
> >>
>                          Post to     : openstack at lists.openstack.org
>                     <mailto:openstack at lists.openstack.org>
>
>                          <mailto:openstack at lists.openstack.org
>                     <mailto:openstack at lists.openstack.org>>
>                          Unsubscribe :
>                     http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openstack.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fopenstack&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098593092&sdata=EM1gsCu55xLMlaPGl5QumwnCR%2FEfgNEEF3GpXOCDshE%3D&reserved=0>
>                     <http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openstack.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fopenstack&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098593092&sdata=EM1gsCu55xLMlaPGl5QumwnCR%2FEfgNEEF3GpXOCDshE%3D&reserved=0>
> >
>                                             <http://lists.openstack.org/
> cgi-bin/mailman/listinfo/openstack
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openstack.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fopenstack&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098593092&sdata=EM1gsCu55xLMlaPGl5QumwnCR%2FEfgNEEF3GpXOCDshE%3D&reserved=0>
>                     <http://lists.openstack.org/cgi-bin/mailman/listinfo/
> openstack
> <https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.openstack.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fopenstack&data=02%7C01%7Cmoshele%40mellanox.com%7C21206586310a435b1ddf08d4c7f436df%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636353299098593092&sdata=EM1gsCu55xLMlaPGl5QumwnCR%2FEfgNEEF3GpXOCDshE%3D&reserved=0>
> >>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20170711/cce2ed2e/attachment.html>


More information about the Openstack mailing list