[Openstack-operators] GPU passthrough success and failure records

Blair Bethwaite blair.bethwaite at gmail.com
Thu Aug 10 05:35:35 UTC 2017


Hi folks,

Related to this, I wonder if anyone has ever seen something like a pci
bus error on a GPU node...? We have a fleet of Dell R730s with dual
K80s and we are periodically seeing the host reset with the hardware
log recording a message like:
"A fatal error was detected on a component at bus 4 device 8 function 0."

Which in this case refers to:
$ lspci -t -d 10b5:8747
-+-[0000:82]---00.0-[83-85]--+-08.0-[84]--
  |                                           \-10.0-[85]--
 +-[0000:03]---00.0-[04-06]--+-08.0-[05]--
  |                                           \-10.0-[06]--

One of the downstream(?) PCIe endpoint facing ports, i.e., the GPU
side of the PCIe switch.

This error causes the host to unceremoniously reset. No error to be
found anywhere host side, just the hardware log. These are currently
Ubuntu Trusty hosts with 4.4 kernel. GPU burn testing does not seem to
trigger it and the host can go back into production and never (so far)
see the issue again. But we've now seen this about 10 times over the
last 12-18 months across a fleet of ~30 of these hosts (sometimes
twice on the same host months apart, but several distinct hosts
overall).

Cheers,

On 7 May 2017 at 07:55, Blair Bethwaite <blair.bethwaite at gmail.com> wrote:
> Hi all,
>
> I've been (very slowly) working on some docs detailing how to setup an
> OpenStack Nova Libvirt+QEMU-KVM deployment to provide GPU-accelerated
> instances. In Boston I hope to chat to some of the docs team and
> figure out an appropriate upstream guide to fit that into. One of the
> things I'd like to provide is a community record (better than ML
> archives) of what works and doesn't. I've started a first attempt at
> collating some basics here:
> https://etherpad.openstack.org/p/GPU-passthrough-model-success-failure
>
> I know there are at a least a few lurkers out there doing this too so
> please share your own experience. Once there is a bit more data there
> it probably makes sense to convert to a tabular format of some kind
> (but wasn't immediately obvious to me how that should look given there
> are several long list fields)
>
> --
> Cheers,
> ~Blairo



-- 
Cheers,
~Blairo



More information about the OpenStack-operators mailing list