GPU PCI passthrough woes.
Hey Folks, I could use a little assistance getting GPU passthrough working. I had this working already for one flavor of nvidia gpu... and I've added some hosts with a much newer gpu... I've updated the pci_alias and pci_passthrough variables and those seem to be getting set properly in nova.conf passthrough_whitelist = [{"vendor_id":"10de", "product_id":"1b06"},{"vendor_id":"10de", "product_id":"26b9"}] alias = {"name": "gpu", "product_id": "1b06", "vendor_id": "10de"} alias = {"name": "gpu-l40s", "product_id": "26b9", "vendor_id": "10de"} I believe I have all of the iommu stuff configured and have the pci- stub module entries... dmesg output shows that the GPUs are being claimed by the stub module. $ openstack flavor show t1.small_gpu_l40s +----------------------------+--------------------------------------+ | Field | Value | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | description | None | | disk | 0 | | id | af70c94e-0026-4a39-bc1e-dfb93b286a54 | | name | t1.small_gpu_l40s | | os-flavor-access:is_public | True | | properties | pci_passthrough:alias='gpu-l40s:1' | | ram | 2048 | | rxtx_factor | 1.0 | | swap | 0 | | vcpus | 1 | +----------------------------+--------------------------------------+ Yet... I can't seem to get an instance to run using that new flavor... keeps complaining about there not being enough hosts available. Fault: code: 500 created: 2024-10-30T02:04:33Z message: "No valid host was found. There are not enough hosts available." details: | Traceback (most recent call last): File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 1580, in schedule_and_build_instances host_lists = self._schedule_instances(context, request_specs[0], File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 940, in _schedule_instances host_lists = self.query_client.select_destinations( File "/usr/lib/python3/dist- packages/nova/scheduler/client/query.py", line 41, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj, File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/client.py", line 189, in call result = self.transport._send( File "/usr/lib/python3/dist- packages/oslo_messaging/transport.py", line 123, in _send return self._driver.send(target, ctxt, message, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send return self._send(target, ctxt, message, wait_for_reply, timeout, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send raise result nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available. Traceback (most recent call last): File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/server.py", line 241, in inner return func(*args, **kwargs) File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 223, in select_destinations selections = self._select_destinations( File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 250, in _select_destinations selections = self._schedule( File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 416, in _schedule self._ensure_sufficient_hosts( File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 455, in _ensure_sufficient_hosts raise exception.NoValidHost(reason=reason) nova.exception.NoValidHost: No valid host was found. There are not enough hosts available. Any clues on how/where to dig into this more to see what might be missing? Thanks. -- Andy Speagle
Hi Andy, We've seen many issues with pci passthrough and I suspect that the l40s will have a seen similar issues with the a100s Here is an alias string that works for our a100s: `alias = {"vendor_id": "10de", "product_id": "20f1", "device_type": "type-PF", "name": "nvidia-tesla-a100-pcie-vga"}` I believe it is the `device_type` that will be your issue Regards Alexander Dibbo – Cloud Architect / Cloud Operations Group Leader For STFC Cloud Documentation visit https://stfc.atlassian.net/wiki/spaces/CLOUDKB/overview To raise a support ticket with the cloud team please email cloud-support@stfc.ac.uk To receive notifications about the service please subscribe to our mailing list at: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=STFC-CLOUD To receive fast notifications or to discuss usage of the cloud please join our Slack: https://stfc-cloud.slack.com/ -----Original Message----- From: Andy Speagle <aspeagle@toyon.com> Sent: 30 October 2024 02:15 To: openstack-discuss@lists.openstack.org Subject: GPU PCI passthrough woes. Hey Folks, I could use a little assistance getting GPU passthrough working. I had this working already for one flavor of nvidia gpu... and I've added some hosts with a much newer gpu... I've updated the pci_alias and pci_passthrough variables and those seem to be getting set properly in nova.conf passthrough_whitelist = [{"vendor_id":"10de", "product_id":"1b06"},{"vendor_id":"10de", "product_id":"26b9"}] alias = {"name": "gpu", "product_id": "1b06", "vendor_id": "10de"} alias = {"name": "gpu-l40s", "product_id": "26b9", "vendor_id": "10de"} I believe I have all of the iommu stuff configured and have the pci- stub module entries... dmesg output shows that the GPUs are being claimed by the stub module. $ openstack flavor show t1.small_gpu_l40s +----------------------------+--------------------------------------+ | Field | Value | +----------------------------+--------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | description | None | | disk | 0 | | id | af70c94e-0026-4a39-bc1e-dfb93b286a54 | | name | t1.small_gpu_l40s | | os-flavor-access:is_public | True | | properties | pci_passthrough:alias='gpu-l40s:1' | | ram | 2048 | | rxtx_factor | 1.0 | | swap | 0 | | vcpus | 1 | +----------------------------+--------------------------------------+ Yet... I can't seem to get an instance to run using that new flavor... keeps complaining about there not being enough hosts available. Fault: code: 500 created: 2024-10-30T02:04:33Z message: "No valid host was found. There are not enough hosts available." details: | Traceback (most recent call last): File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 1580, in schedule_and_build_instances host_lists = self._schedule_instances(context, request_specs[0], File "/usr/lib/python3/dist-packages/nova/conductor/manager.py", line 940, in _schedule_instances host_lists = self.query_client.select_destinations( File "/usr/lib/python3/dist- packages/nova/scheduler/client/query.py", line 41, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj, File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/client.py", line 189, in call result = self.transport._send( File "/usr/lib/python3/dist- packages/oslo_messaging/transport.py", line 123, in _send return self._driver.send(target, ctxt, message, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send return self._send(target, ctxt, message, wait_for_reply, timeout, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send raise result nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available. Traceback (most recent call last): File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/server.py", line 241, in inner return func(*args, **kwargs) File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 223, in select_destinations selections = self._select_destinations( File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 250, in _select_destinations selections = self._schedule( File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 416, in _schedule self._ensure_sufficient_hosts( File "/usr/lib/python3/dist-packages/nova/scheduler/manager.py", line 455, in _ensure_sufficient_hosts raise exception.NoValidHost(reason=reason) nova.exception.NoValidHost: No valid host was found. There are not enough hosts available. Any clues on how/where to dig into this more to see what might be missing? Thanks. -- Andy Speagle
participants (2)
-
Alexander Dibbo - STFC UKRI
-
Andy Speagle