Re: GPU PCI passthrough woes.
Aye, that seems to have done it. Strange that this wasn't necessary for my old 1080 cards. Thanks! On Wed, 2024-10-30 at 08:12 +0000, Alexander Dibbo - STFC UKRI wrote:
⚠️CAUTION⚠️ This email originated outside of Toyon's network. Do not download attachments or click links unless you know the content is safe.
Hi Andy,
We've seen many issues with pci passthrough and I suspect that the l40s will have a seen similar issues with the a100s
Here is an alias string that works for our a100s: `alias = {"vendor_id": "10de", "product_id": "20f1", "device_type": "type-PF", "name": "nvidia-tesla-a100-pcie-vga"}`
I believe it is the `device_type` that will be your issue
Regards
Alexander Dibbo – Cloud Architect / Cloud Operations Group Leader For STFC Cloud Documentation visit https://urldefense.us/v3/__https://stfc.atlassian.net/wiki/spaces/CLOUDKB/ov... To raise a support ticket with the cloud team please email cloud-support@stfc.ac.uk To receive notifications about the service please subscribe to our mailing list at: https://urldefense.us/v3/__https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=ST... To receive fast notifications or to discuss usage of the cloud please join our Slack: https://urldefense.us/v3/__https://stfc-cloud.slack.com/__;!!A4KLn-w15EM4BRM...
-----Original Message----- From: Andy Speagle <aspeagle@toyon.com> Sent: 30 October 2024 02:15 To: openstack-discuss@lists.openstack.org Subject: GPU PCI passthrough woes.
Hey Folks,
I could use a little assistance getting GPU passthrough working. I had this working already for one flavor of nvidia gpu... and I've added some hosts with a much newer gpu...
I've updated the pci_alias and pci_passthrough variables and those seem to be getting set properly in nova.conf
passthrough_whitelist = [{"vendor_id":"10de", "product_id":"1b06"},{"vendor_id":"10de", "product_id":"26b9"}]
alias = {"name": "gpu", "product_id": "1b06", "vendor_id": "10de"} alias = {"name": "gpu-l40s", "product_id": "26b9", "vendor_id": "10de"}
I believe I have all of the iommu stuff configured and have the pci- stub module entries... dmesg output shows that the GPUs are being claimed by the stub module.
$ openstack flavor show t1.small_gpu_l40s +----------------------------+--------------------------------------+
Field | Value | +----------------------------+--------------------------------------+ OS-FLV-DISABLED:disabled | False | OS-FLV-EXT-DATA:ephemeral | 0 | access_project_ids | None | description | None | disk | 0 | id | af70c94e-0026-4a39-bc1e-dfb93b286a54 | name | t1.small_gpu_l40s | os-flavor-access:is_public | True | properties | pci_passthrough:alias='gpu-l40s:1' | ram | 2048 | rxtx_factor | 1.0 | swap | 0 | vcpus | 1 | +----------------------------+--------------------------------------+
Yet... I can't seem to get an instance to run using that new flavor... keeps complaining about there not being enough hosts available.
Fault: code: 500 created: 2024-10-30T02:04:33Z message: "No valid host was found. There are not enough hosts available." details: | Traceback (most recent call last): File "/usr/lib/python3/dist- packages/nova/conductor/manager.py", line 1580, in schedule_and_build_instances host_lists = self._schedule_instances(context, request_specs[0], File "/usr/lib/python3/dist- packages/nova/conductor/manager.py", line 940, in _schedule_instances host_lists = self.query_client.select_destinations( File "/usr/lib/python3/dist- packages/nova/scheduler/client/query.py", line 41, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj, File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/client.py", line 189, in call result = self.transport._send( File "/usr/lib/python3/dist- packages/oslo_messaging/transport.py", line 123, in _send return self._driver.send(target, ctxt, message, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send return self._send(target, ctxt, message, wait_for_reply, timeout, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send raise result nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.
Traceback (most recent call last): File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/server.py", line 241, in inner return func(*args, **kwargs)
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 223, in select_destinations selections = self._select_destinations(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 250, in _select_destinations selections = self._schedule(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 416, in _schedule self._ensure_sufficient_hosts(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 455, in _ensure_sufficient_hosts raise exception.NoValidHost(reason=reason)
nova.exception.NoValidHost: No valid host was found. There are not enough hosts available.
Any clues on how/where to dig into this more to see what might be missing? Thanks.
-- Andy Speagle
-- Andy Speagle Sr. Site Reliability Engineer Toyon Research Corporation 316.617.2431
Your 1080 cards wouldn’t have been SRIOV unlike L40 and A100.
On Oct 30, 2024, at 10:10 AM, Andy Speagle <aspeagle@toyon.com> wrote:
Aye, that seems to have done it. Strange that this wasn't necessary for my old 1080 cards.
Thanks!
On Wed, 2024-10-30 at 08:12 +0000, Alexander Dibbo - STFC UKRI wrote: ⚠️CAUTION⚠️ This email originated outside of Toyon's network. Do not download attachments or click links unless you know the content is safe.
Hi Andy,
We've seen many issues with pci passthrough and I suspect that the l40s will have a seen similar issues with the a100s
Here is an alias string that works for our a100s: `alias = {"vendor_id": "10de", "product_id": "20f1", "device_type": "type-PF", "name": "nvidia-tesla-a100-pcie-vga"}`
I believe it is the `device_type` that will be your issue
Regards
Alexander Dibbo – Cloud Architect / Cloud Operations Group Leader For STFC Cloud Documentation visit https://urldefense.us/v3/__https://stfc.atlassian.net/wiki/spaces/CLOUDKB/ov... To raise a support ticket with the cloud team please email cloud-support@stfc.ac.uk To receive notifications about the service please subscribe to our mailing list at: https://urldefense.us/v3/__https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=ST... To receive fast notifications or to discuss usage of the cloud please join our Slack: https://urldefense.us/v3/__https://stfc-cloud.slack.com/__;!!A4KLn-w15EM4BRM...
-----Original Message----- From: Andy Speagle <aspeagle@toyon.com> Sent: 30 October 2024 02:15 To: openstack-discuss@lists.openstack.org Subject: GPU PCI passthrough woes.
Hey Folks,
I could use a little assistance getting GPU passthrough working. I had this working already for one flavor of nvidia gpu... and I've added some hosts with a much newer gpu...
I've updated the pci_alias and pci_passthrough variables and those seem to be getting set properly in nova.conf
passthrough_whitelist = [{"vendor_id":"10de", "product_id":"1b06"},{"vendor_id":"10de", "product_id":"26b9"}]
alias = {"name": "gpu", "product_id": "1b06", "vendor_id": "10de"} alias = {"name": "gpu-l40s", "product_id": "26b9", "vendor_id": "10de"}
I believe I have all of the iommu stuff configured and have the pci- stub module entries... dmesg output shows that the GPUs are being claimed by the stub module.
$ openstack flavor show t1.small_gpu_l40s +----------------------------+--------------------------------------+
Field | Value | +----------------------------+--------------------------------------+ OS-FLV-DISABLED:disabled | False | OS-FLV-EXT-DATA:ephemeral | 0 | access_project_ids | None | description | None | disk | 0 | id | af70c94e-0026-4a39-bc1e-dfb93b286a54 | name | t1.small_gpu_l40s | os-flavor-access:is_public | True | properties | pci_passthrough:alias='gpu-l40s:1' | ram | 2048 | rxtx_factor | 1.0 | swap | 0 | vcpus | 1 | +----------------------------+--------------------------------------+
Yet... I can't seem to get an instance to run using that new flavor... keeps complaining about there not being enough hosts available.
Fault: code: 500 created: 2024-10-30T02:04:33Z message: "No valid host was found. There are not enough hosts available." details: | Traceback (most recent call last): File "/usr/lib/python3/dist- packages/nova/conductor/manager.py", line 1580, in schedule_and_build_instances host_lists = self._schedule_instances(context, request_specs[0], File "/usr/lib/python3/dist- packages/nova/conductor/manager.py", line 940, in _schedule_instances host_lists = self.query_client.select_destinations( File "/usr/lib/python3/dist- packages/nova/scheduler/client/query.py", line 41, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj, File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/client.py", line 189, in call result = self.transport._send( File "/usr/lib/python3/dist- packages/oslo_messaging/transport.py", line 123, in _send return self._driver.send(target, ctxt, message, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send return self._send(target, ctxt, message, wait_for_reply, timeout, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send raise result nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.
Traceback (most recent call last): File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/server.py", line 241, in inner return func(*args, **kwargs)
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 223, in select_destinations selections = self._select_destinations(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 250, in _select_destinations selections = self._schedule(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 416, in _schedule self._ensure_sufficient_hosts(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 455, in _ensure_sufficient_hosts raise exception.NoValidHost(reason=reason)
nova.exception.NoValidHost: No valid host was found. There are not enough hosts available.
Any clues on how/where to dig into this more to see what might be missing? Thanks.
-- Andy Speagle
-- Andy Speagle Sr. Site Reliability Engineer Toyon Research Corporation 316.617.2431
On 30/10/2024 14:33, Mike Lowe wrote:
Your 1080 cards wouldn’t have been SRIOV unlike L40 and A100.
for some context on this, nova decied the device_type based on the reported pcie capablities of the device. if the card does not report the ability to create virtual function as a pcie capability we track it as type-pci, if it does support VFs based on the pcie capabilities we reprot it as type-PF. by default type-PF devices are filtered out unless expressly asked for in the alias so unlike the gtx 1080 when passing through a l40s you need to explcitly request type-PF in the alias. you can check this with `sudo lspci -nvvs <address>` or you can do `virsh nodedev-dumpxml <name>` nova is parsing the latter but libvirt basically get the same info show by lspci and turns it into an xml.
On Oct 30, 2024, at 10:10 AM, Andy Speagle <aspeagle@toyon.com> wrote:
Aye, that seems to have done it. Strange that this wasn't necessary for my old 1080 cards.
Thanks!
On Wed, 2024-10-30 at 08:12 +0000, Alexander Dibbo - STFC UKRI wrote: ⚠️CAUTION⚠️ This email originated outside of Toyon's network. Do not download attachments or click links unless you know the content is safe.
Hi Andy,
We've seen many issues with pci passthrough and I suspect that the l40s will have a seen similar issues with the a100s
Here is an alias string that works for our a100s: `alias = {"vendor_id": "10de", "product_id": "20f1", "device_type": "type-PF", "name": "nvidia-tesla-a100-pcie-vga"}`
I believe it is the `device_type` that will be your issue
Regards
Alexander Dibbo – Cloud Architect / Cloud Operations Group Leader For STFC Cloud Documentation visit https://urldefense.us/v3/__https://stfc.atlassian.net/wiki/spaces/CLOUDKB/ov... To raise a support ticket with the cloud team please email cloud-support@stfc.ac.uk To receive notifications about the service please subscribe to our mailing list at: https://urldefense.us/v3/__https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=ST... To receive fast notifications or to discuss usage of the cloud please join our Slack: https://urldefense.us/v3/__https://stfc-cloud.slack.com/__;!!A4KLn-w15EM4BRM...
-----Original Message----- From: Andy Speagle <aspeagle@toyon.com> Sent: 30 October 2024 02:15 To: openstack-discuss@lists.openstack.org Subject: GPU PCI passthrough woes.
Hey Folks,
I could use a little assistance getting GPU passthrough working. I had this working already for one flavor of nvidia gpu... and I've added some hosts with a much newer gpu...
I've updated the pci_alias and pci_passthrough variables and those seem to be getting set properly in nova.conf
passthrough_whitelist = [{"vendor_id":"10de", "product_id":"1b06"},{"vendor_id":"10de", "product_id":"26b9"}]
alias = {"name": "gpu", "product_id": "1b06", "vendor_id": "10de"} alias = {"name": "gpu-l40s", "product_id": "26b9", "vendor_id": "10de"}
I believe I have all of the iommu stuff configured and have the pci- stub module entries... dmesg output shows that the GPUs are being claimed by the stub module.
$ openstack flavor show t1.small_gpu_l40s +----------------------------+--------------------------------------+
Field | Value | +----------------------------+--------------------------------------+ OS-FLV-DISABLED:disabled | False | OS-FLV-EXT-DATA:ephemeral | 0 | access_project_ids | None | description | None | disk | 0 | id | af70c94e-0026-4a39-bc1e-dfb93b286a54 | name | t1.small_gpu_l40s | os-flavor-access:is_public | True | properties | pci_passthrough:alias='gpu-l40s:1' | ram | 2048 | rxtx_factor | 1.0 | swap | 0 | vcpus | 1 | +----------------------------+--------------------------------------+
Yet... I can't seem to get an instance to run using that new flavor... keeps complaining about there not being enough hosts available.
Fault: code: 500 created: 2024-10-30T02:04:33Z message: "No valid host was found. There are not enough hosts available." details: | Traceback (most recent call last): File "/usr/lib/python3/dist- packages/nova/conductor/manager.py", line 1580, in schedule_and_build_instances host_lists = self._schedule_instances(context, request_specs[0], File "/usr/lib/python3/dist- packages/nova/conductor/manager.py", line 940, in _schedule_instances host_lists = self.query_client.select_destinations( File "/usr/lib/python3/dist- packages/nova/scheduler/client/query.py", line 41, in select_destinations return self.scheduler_rpcapi.select_destinations(context, spec_obj, File "/usr/lib/python3/dist-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations return cctxt.call(ctxt, 'select_destinations', **msg_args) File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/client.py", line 189, in call result = self.transport._send( File "/usr/lib/python3/dist- packages/oslo_messaging/transport.py", line 123, in _send return self._driver.send(target, ctxt, message, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send return self._send(target, ctxt, message, wait_for_reply, timeout, File "/usr/lib/python3/dist- packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send raise result nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.
Traceback (most recent call last): File "/usr/lib/python3/dist- packages/oslo_messaging/rpc/server.py", line 241, in inner return func(*args, **kwargs)
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 223, in select_destinations selections = self._select_destinations(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 250, in _select_destinations selections = self._schedule(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 416, in _schedule self._ensure_sufficient_hosts(
File "/usr/lib/python3/dist- packages/nova/scheduler/manager.py", line 455, in _ensure_sufficient_hosts raise exception.NoValidHost(reason=reason)
nova.exception.NoValidHost: No valid host was found. There are not enough hosts available.
Any clues on how/where to dig into this more to see what might be missing? Thanks.
-- Andy Speagle
-- Andy Speagle Sr. Site Reliability Engineer Toyon Research Corporation 316.617.2431
participants (3)
-
Andy Speagle
-
Mike Lowe
-
Sean Mooney