Open Stack

Wed Jul 6 14:12:47 UTC 2016

Hi Jon,

We were also running into issues with the K80s.

For our GPU nodes, we've gone with a 4.2 or 4.4 kernel. PCI Passthrough
works much better in those releases. (I ran into odd issues with 4.4 and
NFS, downgraded to 4.2 after a few hours of banging my head, problems went
away, not a scientific solution :)

After that, make sure vfio is loaded:

$ lsmod | grep vfio

Then start with the "deviceQuery" CUDA sample. We've found deviceQuery to
be a great check to see if the instance has full/correct access to the
card. If deviceQuery prints a report within 1-2 seconds, all is well. If
there is a lag, something is off.

In our case for the K80s, that final "something" was qemu. We came across
this[1] wiki page (search for K80) and started digging into qemu. tl;dr:
upgrading to the qemu packages found in the Ubuntu Mitaka cloud archive
solved our issues.

Hope that helps,
Joe

1: https://pve.proxmox.com/wiki/Pci_passthrough
<https://pve.proxmox.com/wiki/Pci_passthrough>

On Wed, Jul 6, 2016 at 7:27 AM, Jonathan D. Proulx <jon at csail.mit.edu>
wrote:

> Hi All,
>
> Trying to spass through some Nvidia K80 GPUs to soem instance and have
> gotten to the place where Nova seems to be doing the right thing gpu
> instances scheduled on the 1 gpu hypervisor I have and for inside the
> VM I see:
>
> root at gpu-x1:~# lspci | grep -i k80
> 00:06.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
>
> And I can install nvdia-361 driver and get
>
> # ls /dev/nvidia*
> /dev/nvidia0  /dev/nvidiactl  /dev/nvidia-uvm  /dev/nvidia-uvm-tools
>
> Once I load up cuda-7.5 and build the exmaples none fo the run
> claiming there's no cuda device.
>
> # ./matrixMul
> [Matrix Multiply Using CUDA] - Starting...
> cudaGetDevice returned error no CUDA-capable device is detected (code 38),
> line(396)
> cudaGetDeviceProperties returned error no CUDA-capable device is detected
> (code 38), line(409)
> MatrixA(160,160), MatrixB(320,160)
> cudaMalloc d_A returned error no CUDA-capable device is detected (code
> 38), line(164)
>
> I'm not familiar with cuda really but I did get some example code
> running on the physical system for burn in over the weekend (sicne
> reinstaleld so no nvidia driver on hypervisor).
>
> Following various online examples  for setting up pass through I set
> the kernel boot line on the hypervisor to:
>
> # cat /proc/cmdline
> BOOT_IMAGE=/boot/vmlinuz-3.13.0-87-generic
> root=UUID=d9bc9159-fedf-475b-b379-f65490c71860 ro console=tty0
> console=ttyS1,115200 intel_iommu=on iommu=pt rd.modules-load=vfio-pci
> nosplash nomodeset intel_iommu=on iommu=pt rd.modules-load=vfio-pci
> nomdmonddf nomdmonisw
>
> Puzzled that I apparently have the device but it is apparently
> nonfunctional, where do I even look from here?
>
> -Jon
>
>
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20160706/8c53d7c9/attachment.html>

Open Stack

[Openstack-operators] PCI Passthrough issues

OpenStack

Community

Documentation

Branding & Legal