<div dir="ltr"><div><div><div>Hi Blair,<br><br></div>We only updated qemu. We're running the version of libvirt from the Kilo cloudarchive.<br><br></div>We've been in production with our K80s for around two weeks now and have had several users report success.<br><br></div><div>Thanks,<br></div>Joe<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jul 19, 2016 at 5:06 PM, Blair Bethwaite <span dir="ltr"><<a href="mailto:blair.bethwaite@gmail.com" target="_blank">blair.bethwaite@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hilariously (or not!) we finally hit the same issue last week once<br>

folks actually started trying to do something (other than build and<br>

load drivers) with the K80s we're passing through. This<br>

<a href="https://devtalk.nvidia.com/default/topic/850833/pci-passthrough-kvm-for-cuda-usage/" rel="noreferrer" target="_blank">https://devtalk.nvidia.com/default/topic/850833/pci-passthrough-kvm-for-cuda-usage/</a><br>

is the best discussion of the issue I've found so far, haven't tracked<br>

down an actual bug yet though. I wonder whether it has something to do<br>

with the memory size of the device, as we've been happy for a long<br>

time with other NVIDIA GPUs (GRID K1, K2, M2070, ...).<br>

<br>

Jon, when you grabbed Mitaka Qemu, did you also update libvirt? We're<br>

just working through this and have tried upgrading both but are<br>

hitting some issues with Nova and Neutron on the compute nodes,<br>

thinking it may libvirt related but debug isn't helping much yet.<br>

<br>

Cheers,<br>

<div class="HOEnZb"><div class="h5"><br>

On 8 July 2016 at 00:54, Jonathan Proulx <<a href="mailto:jon@csail.mit.edu">jon@csail.mit.edu</a>> wrote:<br>

> On Thu, Jul 07, 2016 at 11:13:29AM +1000, Blair Bethwaite wrote:<br>

> :Jon,<br>

> :<br>

> :Awesome, thanks for sharing. We've just run into an issue with SRIOV<br>

> :VF passthrough that sounds like it might be the same problem (device<br>

> :disappearing after a reboot), but haven't yet investigated deeply -<br>

> :this will help with somewhere to start!<br>

><br>

> :By the way, the nouveau mention was because we had missed it on some<br>

> :K80 hypervisors recently and seen passthrough apparently work, but<br>

> :then the NVIDIA drivers would not build in the guest as they claimed<br>

> :they could not find a supported device (despite the GPU being visible<br>

> :on the PCI bus).<br>

><br>

> Definitely sage advice!<br>

><br>

> :I have also heard passing mention of requiring qemu<br>

> :2.3+ but don't have any specific details of the related issue.<br>

><br>

> I didn't do a bisection but with qemu 2.2 (from ubuntu cloudarchive<br>

> kilo) I was sad and with 2.5 (from ubuntu cloudarchive mitaka but<br>

> installed on a kilo hypervisor) I am working.<br>

><br>

> Thanks,<br>

> -Jon<br>

><br>

><br>

> :Cheers,<br>

> :<br>

> :On 7 July 2016 at 08:13, Jonathan Proulx <<a href="mailto:jon@csail.mit.edu">jon@csail.mit.edu</a>> wrote:<br>

> :> On Wed, Jul 06, 2016 at 12:32:26PM -0400, Jonathan D. Proulx wrote:<br>

> :> :<br>

> :> :I do have an odd remaining issue where I can run cuda jobs in the vm<br>

> :> :but snapshots fail and after pause (for snapshotting) the pci device<br>

> :> :can't be reattached (which is where i think it deletes the snapshot<br>

> :> :it took).  Got same issue with 3.16 and 4.4 kernels.<br>

> :> :<br>

> :> :Not very well categorized yet, but I'm hoping it's because the VM I<br>

> :> :was hacking on had it's libvirt.xml written out with the older qemu<br>

> :> :maybe?  It had been through a couple reboots of the physical system<br>

> :> :though.<br>

> :> :<br>

> :> :Currently building a fresh instance and bashing more keys...<br>

> :><br>

> :> After an ugly bout of bashing I've solve my failing snapshot issue<br>

> :> which I'll post here in hopes of saving someonelse<br>

> :><br>

> :> Short version:<br>

> :><br>

> :> add "/dev/vfio/vfio rw," to  /etc/apparmor.d/abstractions/libvirt-qemu<br>

> :> add "ulimit -l unlimited" to /etc/init/libvirt-bin.conf<br>

> :><br>

> :> Longer version:<br>

> :><br>

> :> What was happening.<br>

> :><br>

> :> * send snapshot request<br>

> :> * instance pauses while snapshot is pending<br>

> :> * instance attempt to resume<br>

> :> * fails to reattach pci device<br>

> :>   * nova-compute.log<br>

> :>     Exception during message handling: internal error: unable to execute QEMU command 'device_add': Device initialization failedcompute.log<br>

> :><br>

> :>   * qemu/<id>.log<br>

> :>     vfio: failed to open /dev/vfio/vfio: Permission denied<br>

> :>     vfio: failed to setup container for group 48<br>

> :>     vfio: failed to get group 48<br>

> :> * snapshot disappears<br>

> :> * instance resumes but without passed through device (hard reboot<br>

> :>     reattaches)<br>

> :><br>

> :> seeing permsission denied I though would be an easy fix but:<br>

> :><br>

> :> # ls -l /dev/vfio/vfio<br>

> :> crw-rw-rw- 1 root root 10, 196 Jul  6 14:05 /dev/vfio/vfio<br>

> :><br>

> :> so I'm guessing I'm in apparmor hell, I try adding "/dev/vfio/vfio<br>

> :> rw," to  /etc/apparmor.d/abstractions/libvirt-qemu rebooting the<br>

> :> hypervisor and trying again which gets me a different libvirt error<br>

> :> set:<br>

> :><br>

> :> VFIO_MAP_DMA: -12<br>

> :> vfio_dma_map(0x5633a5fa69b0, 0x0, 0xa0000, 0x7f4e7be00000) = -12 (Cannot allocate memory)<br>

> :><br>

> :> kern.log (and thus dmesg) showing:<br>

> :> vfio_pin_pages: RLIMIT_MEMLOCK (65536) exceeded<br>

> :><br>

> :> Getting rid of this one required inserting 'ulimit -l unlimited' into<br>

> :> /etc/init/libvirt-bin.conf in the 'script' section:<br>

> :><br>

> :> <previous bits excluded><br>

> :> script<br>

> :>         [ -r /etc/default/libvirt-bin ] && . /etc/default/libvirt-bin<br>

> :>         ulimit -l unlimited<br>

> :>         exec /usr/sbin/libvirtd $libvirtd_opts<br>

> :> end script<br>

> :><br>

> :><br>

> :> -Jon<br>

> :><br>

> :> _______________________________________________<br>

> :> OpenStack-operators mailing list<br>

> :> <a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

> :> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

> :<br>

> :<br>

> :<br>

> :--<br>

> :Cheers,<br>

> :~Blairo<br>

><br>

> --<br>

<br>

<br>

<br>

</div></div><span class="HOEnZb"><font color="#888888">--<br>

Cheers,<br>

~Blairo<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

_______________________________________________<br>

OpenStack-operators mailing list<br>

<a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators</a><br>

</div></div></blockquote></div><br></div>