Hi Andrew,
Just wanted to quickly say that I really appreciate your prompt reply and hope you'll be happy to assist further if possible. I've just gotten slightly sidetracked by some other issues but will come back to this in the next week and provide more background info and results of workaround attempts.
Cheers, Blair
On 28 Sep 2016 2:13 AM, "Andrew J Younge" ajyounge@indiana.edu wrote:
Hi Blair,
I'm very interested to hear more about your project using virtualzed GPUs, and hopefully JP and/or myself can be of help here.
So in the past we've struggled with the usage of PCI bridges as a connector between multiple GPUs. This was first seen with Xen and S2070 servers (which has 4 older GPUs across Nvidia PCI bridges) and found that the ACS was prohibiting the successful passthrough of the GPU. While we just decided to use discrete independent adapters moving forward, we've never gone back and tried this with KVM. With that, I can expect the same issues as the ACS cannot guarantee proper isolation of the device. Looking at the K80 GPUs, I'm seeing that there are 3 PLX bridges for each GPU pair (see my output below for a native system w/out KVM), and I'd estimate likely these would be on the same iommu group. This could be the problem.
I have heard that such a patch exists in KVM for you to override the IOMMU groups and ACS protections, however I don't have any experience with it directly [1]. In our experiments, we used an updated SeaBIOS, whereas the link provided below details a UEFI BIOS. This may have different implications that I don't have experience with. Furthermore, I assume this patch will likely just be ignoring all of ACS, which is going to be an obvious and potentially severe security risk. In a purely academic environment such a security risk may not matter, but it should be noted nonetheless.
So, lets take a few steps back to confirm things. Are you able to actually pass both K80 GPUs through to a running KVM instance, and have the Nvidia drivers loaded? Any dmesg output errors here may go a long way. Are you also passing through the PCI bridge device (lspci should show one)? If you're actually making it that far, it may next be worth simply running a regular CUDA application set first before trying any GPUDirect methods. For our GPUDirect usage, we were specifically leveraging the RDMA support with an InfiniBand adapter rather than CUDA P2P, so your mileage may vary there as well.
Hopefully this is helpful in finding your problem. With this, I'd be interested to hear if the ACS override mechanism, or any other option works for enabling passthrough with K80 GPUs (we have a few dozen non-virtualized for another project). If you have any other non-bridged GPU cards (like a K20 or C2075) lying around, it may be worth giving that a try to try to rule-out other potential issues first.
[1] https://wiki.archlinux.org/index.php/PCI_passthrough_via_ OVMF#Bypassing_the_IOMMU_groups_.28ACS_override_patch.29
[root@r-001 ~]# lspci | grep -i -e PLX -e nvidia 02:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 03:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 03:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 06:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 07:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 07:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 09:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 82:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 83:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 83:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 84:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 85:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 86:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 87:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 87:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca) 88:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) 89:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) [root@r-001 ~]# nvidia-smi topo --matrix GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity GPU0 X PIX PHB PHB SOC SOC SOC SOC SOC 0-11,24-35 GPU1 PIX X PHB PHB SOC SOC SOC SOC SOC 0-11,24-35 GPU2 PHB PHB X PIX SOC SOC SOC SOC SOC 0-11,24-35 GPU3 PHB PHB PIX X SOC SOC SOC SOC SOC 0-11,24-35 GPU4 SOC SOC SOC SOC X PIX PHB PHB PHB 12-23,36-47 GPU5 SOC SOC SOC SOC PIX X PHB PHB PHB 12-23,36-47 GPU6 SOC SOC SOC SOC PHB PHB X PIX PHB 12-23,36-47 GPU7 SOC SOC SOC SOC PHB PHB PIX X PHB 12-23,36-47 mlx4_0 SOC SOC SOC SOC PHB PHB PHB PHB X
Legend:
X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch
Cheers, Andrew
Andrew J. Younge School of Informatics & Computing Indiana University / Bloomington, IN USA ajyounge@indiana.edu / http://ajyounge.com
On Tue, Sep 27, 2016 at 4:37 AM, Blair Bethwaite blair.bethwaite@gmail.com wrote:
Hi Andrew, hi John -
I've just started trying to get CUDA P2P working in our virtualized HPC environment. I figure this must be something you solved already in order to produce the aforementioned paper, but having read it a couple of times I don't think it provides enough detail about the guest config, hoping you can shed some light...
The issue I'm grappling with is that despite using a qemu-kvm machine type (q35) with an emulated PCIe bus and seeing that indeed the P2P capable GPUs (NVIDIA K80s) are attached to that bus, and nvidia-smi sees them as sharing a PHB, the simpleP2P CUDA sample fails when checking their ability to communicate with each other. Is there some magic config I might be missing, did you need to make any PCI-ACS changes?
Best regards, Blair
On 16 March 2016 at 07:57, Blair Bethwaite blair.bethwaite@gmail.com
wrote:
Hi Andrew,
On 16 March 2016 at 05:28, Andrew J Younge ajyounge@indiana.edu
wrote:
point to a recent publication of ours at VEE15 titled "Supporting High Performance Molecular Dynamics in Virtualized Clusters using IOMMU, SR-IOV, and GPUDirect." In the paper we show that using Nvidia GPUs
...
Oooh interesting - GPUDirect too. That's something I've been wanting to try out in our environment. Will take a look a your paper...
-- Cheers, ~Blairo
-- Cheers, ~Blairo