Aye, that seems to have done it. Strange that this wasn't necessary for my old 1080 cards. Thanks! On Wed, 2024-10-30 at 08:12 +0000, Alexander Dibbo - STFC UKRI wrote:
⚠️CAUTION⚠️ This email originated outside of Toyon's network. Do not download attachments or click links unless you know the content is safe.
Hi Andy,
We've seen many issues with pci passthrough and I suspect that the l40s will have a seen similar issues with the a100s
Here is an alias string that works for our a100s: `alias = {"vendor_id": "10de", "product_id": "20f1", "device_type": "type-PF", "name": "nvidia-tesla-a100-pcie-vga"}`
I believe it is the `device_type` that will be your issue
Regards
Alexander Dibbo – Cloud Architect / Cloud Operations Group Leader For STFC Cloud Documentation visit https://urldefense.us/v3/__https://stfc.atlassian.net/wiki/spaces/CLOUDKB/ov... To raise a support ticket with the cloud team please email cloud-support@stfc.ac.uk To receive notifications about the service please subscribe to our mailing list at: https://urldefense.us/v3/__https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=ST... To receive fast notifications or to discuss usage of the cloud please join our Slack: https://urldefense.us/v3/__https://stfc-cloud.slack.com/__;!!A4KLn-w15EM4BRM...
-----Original Message----- From: Andy Speagle <aspeagle@toyon.com> Sent: 30 October 2024 02:15 To: openstack-discuss@lists.openstack.org Subject: GPU PCI passthrough woes.
Hey Folks,
I could use a little assistance getting GPU passthrough working. I had this working already for one flavor of nvidia gpu... and I've added some hosts with a much newer gpu...
I've updated the pci_alias and pci_passthrough variables and those seem to be getting set properly in nova.conf
passthrough_whitelist = [{"vendor_id":"10de", "product_id":"1b06"},{"vendor_id":"10de", "product_id":"26b9"}]
alias = {"name": "gpu", "product_id": "1b06", "vendor_id": "10de"} alias = {"name": "gpu-l40s", "product_id": "26b9", "vendor_id": "10de"}
I believe I have all of the iommu stuff configured and have the pci- stub module entries... dmesg output shows that the GPUs are being claimed by the stub module.
$ openstack flavor show t1.small_gpu_l40s +----------------------------+--------------------------------------+
Field | Value | +----------------------------+--------------------------------------+ OS-FLV-DISABLED:disabled | False | OS-FLV-EXT-DATA:ephemeral | 0 | access_project_ids | None | description | None | disk | 0 | id | af70c94e-0026-4a39-bc1e-dfb93b286a54 | name | t1.small_gpu_l40s | os-flavor-access:is_public | True | properties | pci_passthrough:alias='gpu-l40s:1' | ram | 2048 | rxtx_factor | 1.0 | swap | 0 | vcpus | 1 | +----------------------------+--------------------------------------+
Yet... I can't seem to get an instance to run using that new flavor... keeps complaining about there not being enough hosts available.
Fault: code: 500 created: 2024-10-30T02:04:33Z message: "No valid host was found. There are not enough hosts available." details: | Traceback (most recent call last): File "/usr/lib/python3/dist- packages/nova/conductor/manager.py", line 1580, in schedule_and_build_instances host_lists = self._schedule_instances(context, request_specs[0], File "/usr/lib/python3/dist- packages/nova/conductor/manager.py", line 940, in _schedule_instances host_lists = self.query_client.select_destinations( File "/usr/lib/python3/dist- packages/nova/scheduler/client/query.py", line 41, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj, File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/client.py", line 189, in call result = self.transport._send( File "/usr/lib/python3/dist- packages/oslo_messaging/transport.py", line 123, in _send return self._driver.send(target, ctxt, message, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send return self._send(target, ctxt, message, wait_for_reply, timeout, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send raise result nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.
Traceback (most recent call last): File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/server.py", line 241, in inner return func(*args, **kwargs)
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 223, in select_destinations selections = self._select_destinations(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 250, in _select_destinations selections = self._schedule(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 416, in _schedule self._ensure_sufficient_hosts(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 455, in _ensure_sufficient_hosts raise exception.NoValidHost(reason=reason)
nova.exception.NoValidHost: No valid host was found. There are not enough hosts available.
Any clues on how/where to dig into this more to see what might be missing? Thanks.
-- Andy Speagle
-- Andy Speagle Sr. Site Reliability Engineer Toyon Research Corporation 316.617.2431