Open Stack

Tue Sep 27 16:12:15 UTC 2016

Hi Blair,

I'm very interested to hear more about your project using virtualzed
GPUs, and hopefully JP and/or myself can be of help here.

So in the past we've struggled with the usage of PCI bridges as a
connector between multiple GPUs. This was first seen with Xen and
S2070 servers (which has 4 older GPUs across Nvidia PCI bridges) and
found that the ACS was prohibiting the successful passthrough of the
GPU. While we just decided to use discrete independent adapters moving
forward, we've never gone back and tried this with KVM. With that, I
can expect the same issues as the ACS cannot guarantee proper
isolation of the device. Looking at the K80 GPUs, I'm seeing that
there are 3 PLX bridges for each GPU pair (see my output below for a
native system w/out KVM), and I'd estimate likely these would be on
the same iommu group.  This could be the problem.

I have heard that such a patch exists in KVM for you to override the
IOMMU groups and ACS protections, however I don't have any experience
with it directly [1]. In our experiments, we used an updated SeaBIOS,
whereas the link provided below details a UEFI BIOS.  This may have
different implications that I don't have experience with.
Furthermore, I assume this patch will likely just be ignoring all of
ACS, which is going to be an obvious and potentially severe security
risk. In a purely academic environment such a security risk may not
matter, but it should be noted nonetheless.

So, lets take a few steps back to confirm things.   Are you able to
actually pass both K80 GPUs through to a running KVM instance, and
have the Nvidia drivers loaded? Any dmesg output errors here may go a
long way. Are you also passing through the PCI bridge device (lspci
should show one)? If you're actually making it that far, it may next
be worth simply running a regular CUDA application set first before
trying any GPUDirect methods. For our GPUDirect usage, we were
specifically leveraging the RDMA support with an InfiniBand adapter
rather than CUDA P2P, so your mileage may vary there as well.

Hopefully this is helpful in finding your problem. With this, I'd be
interested to hear if the ACS override mechanism, or any other option
works for enabling passthrough with K80 GPUs (we have a few dozen
non-virtualized for another project).  If you have any other
non-bridged GPU cards (like a K20 or C2075) lying around, it may be
worth giving that a try to try to rule-out other potential issues
first.

[1] https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#Bypassing_the_IOMMU_groups_.28ACS_override_patch.29

[root at r-001 ~]# lspci | grep -i -e PLX -e nvidia
02:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
03:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
03:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
06:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
07:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
07:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
09:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
82:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
84:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
85:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
86:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
87:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
87:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
88:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
89:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
[root at r-001 ~]# nvidia-smi topo --matrix
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity
GPU0 X PIX PHB PHB SOC SOC SOC SOC SOC 0-11,24-35
GPU1 PIX X PHB PHB SOC SOC SOC SOC SOC 0-11,24-35
GPU2 PHB PHB X PIX SOC SOC SOC SOC SOC 0-11,24-35
GPU3 PHB PHB PIX X SOC SOC SOC SOC SOC 0-11,24-35
GPU4 SOC SOC SOC SOC X PIX PHB PHB PHB 12-23,36-47
GPU5 SOC SOC SOC SOC PIX X PHB PHB PHB 12-23,36-47
GPU6 SOC SOC SOC SOC PHB PHB X PIX PHB 12-23,36-47
GPU7 SOC SOC SOC SOC PHB PHB PIX X PHB 12-23,36-47
mlx4_0 SOC SOC SOC SOC PHB PHB PHB PHB X

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch

Cheers,
Andrew

Andrew J. Younge
School of Informatics & Computing
Indiana University            /    Bloomington, IN USA
ajyounge at indiana.edu    /    http://ajyounge.com

On Tue, Sep 27, 2016 at 4:37 AM, Blair Bethwaite
<blair.bethwaite at gmail.com> wrote:
> Hi Andrew, hi John -
>
> I've just started trying to get CUDA P2P working in our virtualized
> HPC environment. I figure this must be something you solved already in
> order to produce the aforementioned paper, but having read it a couple
> of times I don't think it provides enough detail about the guest
> config, hoping you can shed some light...
>
> The issue I'm grappling with is that despite using a qemu-kvm machine
> type (q35) with an emulated PCIe bus and seeing that indeed the P2P
> capable GPUs (NVIDIA K80s) are attached to that bus, and nvidia-smi
> sees them as sharing a PHB, the simpleP2P CUDA sample fails when
> checking their ability to communicate with each other. Is there some
> magic config I might be missing, did you need to make any PCI-ACS
> changes?
>
> Best regards,
> Blair
>
>
> On 16 March 2016 at 07:57, Blair Bethwaite <blair.bethwaite at gmail.com> wrote:
>>
>> Hi Andrew,
>>
>> On 16 March 2016 at 05:28, Andrew J Younge <ajyounge at indiana.edu> wrote:
>> > point to a recent publication of ours at VEE15 titled "Supporting High
>> > Performance Molecular Dynamics in Virtualized Clusters using IOMMU,
>> > SR-IOV, and GPUDirect."  In the paper we show that using Nvidia GPUs
>> ...
>> > http://dl.acm.org/citation.cfm?id=2731194
>>
>> Oooh interesting - GPUDirect too. That's something I've been wanting
>> to try out in our environment. Will take a look a your paper...
>>
>> --
>> Cheers,
>> ~Blairo
>
>
> --
> Cheers,
> ~Blairo

Open Stack

[openstack-hpc] What's the state of openstack-hpc now?

OpenStack

Community

Documentation

Branding & Legal