[Openstack-operators] PCI Passthrough issues
Jonathan D. Proulx
jon at csail.mit.edu
Wed Jul 6 16:32:26 UTC 2016
Joe, seems to have been mostly solved with the qemu upgrade. Since I
plan on being on Mitaka before blessing the gpu instances with the
'production' label I'm OK with that.
Blair I reflexively black list nouveau drivers about 5 ways in my
installer and six in puppet :)
I do have an odd remaining issue where I can run cuda jobs in the vm
but snapshots fail and after pause (for snapshotting) the pci device
can't be reattached (which is where i think it deletes the snapshot
it took). Got same issue with 3.16 and 4.4 kernels.
Not very well categorized yet, but I'm hoping it's because the VM I
was hacking on had it's libvirt.xml written out with the older qemu
maybe? It had been through a couple reboots of the physical system
though.
Currently building a fresh instance and bashing more keys...
Thanks all,
-Jon
On Thu, Jul 07, 2016 at 12:35:33AM +1000, Blair Bethwaite wrote:
:Hi Jon,
:
:Do you have the nouveau driver/module loaded in the host by any
:chance? If so, blacklist, reboot, repeat.
:
:Whilst we're talking about this. Has anyone had any luck doing this
:with hosts having a PCI-e switch across multiple GPUs?
:
:Cheers,
:
:On 6 July 2016 at 23:27, Jonathan D. Proulx <jon at csail.mit.edu> wrote:
:> Hi All,
:>
:> Trying to spass through some Nvidia K80 GPUs to soem instance and have
:> gotten to the place where Nova seems to be doing the right thing gpu
:> instances scheduled on the 1 gpu hypervisor I have and for inside the
:> VM I see:
:>
:> root at gpu-x1:~# lspci | grep -i k80
:> 00:06.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
:>
:> And I can install nvdia-361 driver and get
:>
:> # ls /dev/nvidia*
:> /dev/nvidia0 /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools
:>
:> Once I load up cuda-7.5 and build the exmaples none fo the run
:> claiming there's no cuda device.
:>
:> # ./matrixMul
:> [Matrix Multiply Using CUDA] - Starting...
:> cudaGetDevice returned error no CUDA-capable device is detected (code 38), line(396)
:> cudaGetDeviceProperties returned error no CUDA-capable device is detected (code 38), line(409)
:> MatrixA(160,160), MatrixB(320,160)
:> cudaMalloc d_A returned error no CUDA-capable device is detected (code 38), line(164)
:>
:> I'm not familiar with cuda really but I did get some example code
:> running on the physical system for burn in over the weekend (sicne
:> reinstaleld so no nvidia driver on hypervisor).
:>
:> Following various online examples for setting up pass through I set
:> the kernel boot line on the hypervisor to:
:>
:> # cat /proc/cmdline
:> BOOT_IMAGE=/boot/vmlinuz-3.13.0-87-generic root=UUID=d9bc9159-fedf-475b-b379-f65490c71860 ro console=tty0 console=ttyS1,115200 intel_iommu=on iommu=pt rd.modules-load=vfio-pci nosplash nomodeset intel_iommu=on iommu=pt rd.modules-load=vfio-pci nomdmonddf nomdmonisw
:>
:> Puzzled that I apparently have the device but it is apparently
:> nonfunctional, where do I even look from here?
:>
:> -Jon
:>
:>
:> _______________________________________________
:> OpenStack-operators mailing list
:> OpenStack-operators at lists.openstack.org
:> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
:
:
:
:--
:Cheers,
:~Blairo
--
More information about the OpenStack-operators
mailing list