Open Stack

Wed Jul 6 16:32:26 UTC 2016

Joe, seems to have been mostly solved with the qemu upgrade.  Since I
plan on being on Mitaka before blessing the gpu instances with the
'production' label I'm OK with that.

Blair I reflexively black list nouveau drivers about 5 ways in my
installer and six in puppet :)

I do have an odd remaining issue where I can run cuda jobs in the vm
but snapshots fail and after pause (for snapshotting) the pci device
can't be reattached (which is where i think it deletes the snapshot
it took).  Got same issue with 3.16 and 4.4 kernels.

Not very well categorized yet, but I'm hoping it's because the VM I
was hacking on had it's libvirt.xml written out with the older qemu
maybe?  It had been through a couple reboots of the physical system
though.

Currently building a fresh instance and bashing more keys...

Thanks all,

-Jon

On Thu, Jul 07, 2016 at 12:35:33AM +1000, Blair Bethwaite wrote:
:Hi Jon,
:
:Do you have the nouveau driver/module loaded in the host by any
:chance? If so, blacklist, reboot, repeat.
:
:Whilst we're talking about this. Has anyone had any luck doing this
:with hosts having a PCI-e switch across multiple GPUs?
:
:Cheers,
:
:On 6 July 2016 at 23:27, Jonathan D. Proulx <jon at csail.mit.edu> wrote:
:> Hi All,
:>
:> Trying to spass through some Nvidia K80 GPUs to soem instance and have
:> gotten to the place where Nova seems to be doing the right thing gpu
:> instances scheduled on the 1 gpu hypervisor I have and for inside the
:> VM I see:
:>
:> root at gpu-x1:~# lspci | grep -i k80
:> 00:06.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
:>
:> And I can install nvdia-361 driver and get
:>
:> # ls /dev/nvidia*
:> /dev/nvidia0  /dev/nvidiactl  /dev/nvidia-uvm  /dev/nvidia-uvm-tools
:>
:> Once I load up cuda-7.5 and build the exmaples none fo the run
:> claiming there's no cuda device.
:>
:> # ./matrixMul
:> [Matrix Multiply Using CUDA] - Starting...
:> cudaGetDevice returned error no CUDA-capable device is detected (code 38), line(396)
:> cudaGetDeviceProperties returned error no CUDA-capable device is detected (code 38), line(409)
:> MatrixA(160,160), MatrixB(320,160)
:> cudaMalloc d_A returned error no CUDA-capable device is detected (code 38), line(164)
:>
:> I'm not familiar with cuda really but I did get some example code
:> running on the physical system for burn in over the weekend (sicne
:> reinstaleld so no nvidia driver on hypervisor).
:>
:> Following various online examples  for setting up pass through I set
:> the kernel boot line on the hypervisor to:
:>
:> # cat /proc/cmdline
:> BOOT_IMAGE=/boot/vmlinuz-3.13.0-87-generic root=UUID=d9bc9159-fedf-475b-b379-f65490c71860 ro console=tty0 console=ttyS1,115200 intel_iommu=on iommu=pt rd.modules-load=vfio-pci nosplash nomodeset intel_iommu=on iommu=pt rd.modules-load=vfio-pci nomdmonddf nomdmonisw
:>
:> Puzzled that I apparently have the device but it is apparently
:> nonfunctional, where do I even look from here?
:>
:> -Jon
:>
:>
:> _______________________________________________
:> OpenStack-operators mailing list
:> OpenStack-operators at lists.openstack.org
:> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
:
:
:
:-- 
:Cheers,
:~Blairo

-- 

Open Stack

[Openstack-operators] PCI Passthrough issues

OpenStack

Community

Documentation

Branding & Legal