Re: [openstack-hpc] What's the state of openstack-hpc now?

27 Sep 2016

      Hi Blair,

I'm very interested to hear more about your project using virtualzed
GPUs, and hopefully JP and/or myself can be of help here.

So in the past we've struggled with the usage of PCI bridges as a
connector between multiple GPUs. This was first seen with Xen and
S2070 servers (which has 4 older GPUs across Nvidia PCI bridges) and
found that the ACS was prohibiting the successful passthrough of the
GPU. While we just decided to use discrete independent adapters moving
forward, we've never gone back and tried this with KVM. With that, I
can expect the same issues as the ACS cannot guarantee proper
isolation of the device. Looking at the K80 GPUs, I'm seeing that
there are 3 PLX bridges for each GPU pair (see my output below for a
native system w/out KVM), and I'd estimate likely these would be on
the same iommu group.  This could be the problem.

I have heard that such a patch exists in KVM for you to override the
IOMMU groups and ACS protections, however I don't have any experience
with it directly [1]. In our experiments, we used an updated SeaBIOS,
whereas the link provided below details a UEFI BIOS.  This may have
different implications that I don't have experience with.
Furthermore, I assume this patch will likely just be ignoring all of
ACS, which is going to be an obvious and potentially severe security
risk. In a purely academic environment such a security risk may not
matter, but it should be noted nonetheless.

So, lets take a few steps back to confirm things.   Are you able to
actually pass both K80 GPUs through to a running KVM instance, and
have the Nvidia drivers loaded? Any dmesg output errors here may go a
long way. Are you also passing through the PCI bridge device (lspci
should show one)? If you're actually making it that far, it may next
be worth simply running a regular CUDA application set first before
trying any GPUDirect methods. For our GPUDirect usage, we were
specifically leveraging the RDMA support with an InfiniBand adapter
rather than CUDA P2P, so your mileage may vary there as well.

Hopefully this is helpful in finding your problem. With this, I'd be
interested to hear if the ACS override mechanism, or any other option
works for enabling passthrough with K80 GPUs (we have a few dozen
non-virtualized for another project).  If you have any other
non-bridged GPU cards (like a K20 or C2075) lying around, it may be
worth giving that a try to try to rule-out other potential issues
first.

[1] https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#Bypassing_the_...

[root@r-001 ~]# lspci | grep -i -e PLX -e nvidia
02:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
03:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
03:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
04:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
05:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
06:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
07:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
07:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
08:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
09:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
82:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
83:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
84:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
85:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
86:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
87:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
87:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
88:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
89:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
[root@r-001 ~]# nvidia-smi topo --matrix
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx4_0 CPU Affinity
GPU0 X PIX PHB PHB SOC SOC SOC SOC SOC 0-11,24-35
GPU1 PIX X PHB PHB SOC SOC SOC SOC SOC 0-11,24-35
GPU2 PHB PHB X PIX SOC SOC SOC SOC SOC 0-11,24-35
GPU3 PHB PHB PIX X SOC SOC SOC SOC SOC 0-11,24-35
GPU4 SOC SOC SOC SOC X PIX PHB PHB PHB 12-23,36-47
GPU5 SOC SOC SOC SOC PIX X PHB PHB PHB 12-23,36-47
GPU6 SOC SOC SOC SOC PHB PHB X PIX PHB 12-23,36-47
GPU7 SOC SOC SOC SOC PHB PHB PIX X PHB 12-23,36-47
mlx4_0 SOC SOC SOC SOC PHB PHB PHB PHB X

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch

Cheers,
Andrew

Andrew J. Younge
School of Informatics & Computing
Indiana University            /    Bloomington, IN USA
ajyounge@indiana.edu    /    http://ajyounge.com

On Tue, Sep 27, 2016 at 4:37 AM, Blair Bethwaite
<blair.bethwaite@gmail.com> wrote:
...
Hi Andrew, hi John -
I've just started trying to get CUDA P2P working in our virtualized
HPC environment. I figure this must be something you solved already in
order to produce the aforementioned paper, but having read it a couple
of times I don't think it provides enough detail about the guest
config, hoping you can shed some light...
The issue I'm grappling with is that despite using a qemu-kvm machine
type (q35) with an emulated PCIe bus and seeing that indeed the P2P
capable GPUs (NVIDIA K80s) are attached to that bus, and nvidia-smi
sees them as sharing a PHB, the simpleP2P CUDA sample fails when
checking their ability to communicate with each other. Is there some
magic config I might be missing, did you need to make any PCI-ACS
changes?
Best regards,
Blair
On 16 March 2016 at 07:57, Blair Bethwaite <blair.bethwaite@gmail.com> wrote:
...
Hi Andrew,
On 16 March 2016 at 05:28, Andrew J Younge <ajyounge@indiana.edu> wrote:
...
point to a recent publication of ours at VEE15 titled "Supporting High
Performance Molecular Dynamics in Virtualized Clusters using IOMMU,
SR-IOV, and GPUDirect."  In the paper we show that using Nvidia GPUs
...
http://dl.acm.org/citation.cfm?id=2731194
Oooh interesting - GPUDirect too. That's something I've been wanting
to try out in our environment. Will take a look a your paper...
--
Cheers,
~Blairo
--
Cheers,
~Blairo