[Openstack-operators] PCI Passthrough issues

Blair Bethwaite blair.bethwaite at gmail.com
Tue Jul 19 23:06:15 UTC 2016


Hilariously (or not!) we finally hit the same issue last week once
folks actually started trying to do something (other than build and
load drivers) with the K80s we're passing through. This
https://devtalk.nvidia.com/default/topic/850833/pci-passthrough-kvm-for-cuda-usage/
is the best discussion of the issue I've found so far, haven't tracked
down an actual bug yet though. I wonder whether it has something to do
with the memory size of the device, as we've been happy for a long
time with other NVIDIA GPUs (GRID K1, K2, M2070, ...).

Jon, when you grabbed Mitaka Qemu, did you also update libvirt? We're
just working through this and have tried upgrading both but are
hitting some issues with Nova and Neutron on the compute nodes,
thinking it may libvirt related but debug isn't helping much yet.

Cheers,

On 8 July 2016 at 00:54, Jonathan Proulx <jon at csail.mit.edu> wrote:
> On Thu, Jul 07, 2016 at 11:13:29AM +1000, Blair Bethwaite wrote:
> :Jon,
> :
> :Awesome, thanks for sharing. We've just run into an issue with SRIOV
> :VF passthrough that sounds like it might be the same problem (device
> :disappearing after a reboot), but haven't yet investigated deeply -
> :this will help with somewhere to start!
>
> :By the way, the nouveau mention was because we had missed it on some
> :K80 hypervisors recently and seen passthrough apparently work, but
> :then the NVIDIA drivers would not build in the guest as they claimed
> :they could not find a supported device (despite the GPU being visible
> :on the PCI bus).
>
> Definitely sage advice!
>
> :I have also heard passing mention of requiring qemu
> :2.3+ but don't have any specific details of the related issue.
>
> I didn't do a bisection but with qemu 2.2 (from ubuntu cloudarchive
> kilo) I was sad and with 2.5 (from ubuntu cloudarchive mitaka but
> installed on a kilo hypervisor) I am working.
>
> Thanks,
> -Jon
>
>
> :Cheers,
> :
> :On 7 July 2016 at 08:13, Jonathan Proulx <jon at csail.mit.edu> wrote:
> :> On Wed, Jul 06, 2016 at 12:32:26PM -0400, Jonathan D. Proulx wrote:
> :> :
> :> :I do have an odd remaining issue where I can run cuda jobs in the vm
> :> :but snapshots fail and after pause (for snapshotting) the pci device
> :> :can't be reattached (which is where i think it deletes the snapshot
> :> :it took).  Got same issue with 3.16 and 4.4 kernels.
> :> :
> :> :Not very well categorized yet, but I'm hoping it's because the VM I
> :> :was hacking on had it's libvirt.xml written out with the older qemu
> :> :maybe?  It had been through a couple reboots of the physical system
> :> :though.
> :> :
> :> :Currently building a fresh instance and bashing more keys...
> :>
> :> After an ugly bout of bashing I've solve my failing snapshot issue
> :> which I'll post here in hopes of saving someonelse
> :>
> :> Short version:
> :>
> :> add "/dev/vfio/vfio rw," to  /etc/apparmor.d/abstractions/libvirt-qemu
> :> add "ulimit -l unlimited" to /etc/init/libvirt-bin.conf
> :>
> :> Longer version:
> :>
> :> What was happening.
> :>
> :> * send snapshot request
> :> * instance pauses while snapshot is pending
> :> * instance attempt to resume
> :> * fails to reattach pci device
> :>   * nova-compute.log
> :>     Exception during message handling: internal error: unable to execute QEMU command 'device_add': Device initialization failedcompute.log
> :>
> :>   * qemu/<id>.log
> :>     vfio: failed to open /dev/vfio/vfio: Permission denied
> :>     vfio: failed to setup container for group 48
> :>     vfio: failed to get group 48
> :> * snapshot disappears
> :> * instance resumes but without passed through device (hard reboot
> :>     reattaches)
> :>
> :> seeing permsission denied I though would be an easy fix but:
> :>
> :> # ls -l /dev/vfio/vfio
> :> crw-rw-rw- 1 root root 10, 196 Jul  6 14:05 /dev/vfio/vfio
> :>
> :> so I'm guessing I'm in apparmor hell, I try adding "/dev/vfio/vfio
> :> rw," to  /etc/apparmor.d/abstractions/libvirt-qemu rebooting the
> :> hypervisor and trying again which gets me a different libvirt error
> :> set:
> :>
> :> VFIO_MAP_DMA: -12
> :> vfio_dma_map(0x5633a5fa69b0, 0x0, 0xa0000, 0x7f4e7be00000) = -12 (Cannot allocate memory)
> :>
> :> kern.log (and thus dmesg) showing:
> :> vfio_pin_pages: RLIMIT_MEMLOCK (65536) exceeded
> :>
> :> Getting rid of this one required inserting 'ulimit -l unlimited' into
> :> /etc/init/libvirt-bin.conf in the 'script' section:
> :>
> :> <previous bits excluded>
> :> script
> :>         [ -r /etc/default/libvirt-bin ] && . /etc/default/libvirt-bin
> :>         ulimit -l unlimited
> :>         exec /usr/sbin/libvirtd $libvirtd_opts
> :> end script
> :>
> :>
> :> -Jon
> :>
> :> _______________________________________________
> :> OpenStack-operators mailing list
> :> OpenStack-operators at lists.openstack.org
> :> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
> :
> :
> :
> :--
> :Cheers,
> :~Blairo
>
> --



-- 
Cheers,
~Blairo



More information about the OpenStack-operators mailing list