[Openstack-operators] outstanding issues with GPU passthrough
blair.bethwaite at monash.edu
Tue Mar 20 06:47:05 UTC 2018
This has turned into a bit of a screed I'm afraid...
Last week I had pings from both the folks at CERN and Catalyst Cloud about
GPU-accelerated OpenStack instances with PCI passthrough, specifically
asking about the security issues I've mentioned in community forums
previously, and any other gotchas. So I figured I'd attempt to answer what
I can on-list and hopefully others will share too. I'll also take an action
to propose a Forum session something along the lines of "GPUs - state of
the art & practice" where we can come together in Vancouver to discuss
further. Just writing this has prompted me to pick-up where I last left off
on this - currently looking for some upstream QEMU expertise...
Firstly, there is this wiki page: https://wiki.openstack.org/wiki/GPUs,
which should be relevant but could be expanded based on this discussion.
**Issues / Gotchas**
These days for the most part things just seem to work if you follow the
basic docs and then ensure the usual things required with vfio-pci, like
making sure no other host drivers are bound to the device/s. These are the
basics which have been covered in a number of Summit presentations, most
recently see https://www.openstack.org/videos/sydney-2017/not-only-
for-miners-gpu-integration-in-nova-environment for a good overview. This
blog http://www.vscaler.com/gpu-passthrough/, although a little dated, is
still relevant. Perhaps it would be worth adding these things to docs or
One issue that we've hit a couple of times now, most recently only last
week, is with apparmor on Ubuntu being too restrictive when the passthrough
device needs to be reattached post-snapshot. This has been discussed
on-list in the past - see "[Openstack-operators] PCI Passthrough issues"
for a good post-mortem from Jon at MIT. So far I'm not sure if this most
recent incarnation is due to a new bug with newer cloudarchive hypervisor
stack, or because we have stale templates in our Puppet that are
overwriting something we should be getting from the package-shipped
apparmor rules - if it turns out to be a new bug we'll report upstream...
Perhaps more concerning for new deployers today (assuming deep-learning is
a major motivator for adopting this capability) is that GPU P2P doesn't
work inside a typical guest instance with multiple GPUs passed through.
That's because the emulated flat PCI (not even PCIe) topology inside the
guest will make the device drivers think this isn't possible. However, GPU
clique support was added to QEMU 2.11 in
There's no Libvirt support for this yet, so I'd expect it to be at least
another couple of cycles before we might see this hitting Nova. In any
case, we are about to start kicking the tires on it and will report back.
The big issue that's inherent with PCI passthrough is that you have to give
up the whole device (and arguably whole server if you are really concerned
about security). There is also potential complexity with switched PCIe
topologies, which you're likely to encounter on any host with more than a
couple of GPUs - if you have "wrong" PCIe chipset then you may not be able
to properly isolate the devices from each other. I believe the hyperscalers
may use PCIe switching fabrics with external device housing, as opposed to
GPUs directly inside the host. They have gone to some effort to ensure
things like PCIe Address Translation Services (ATS) get turned off - ATS is
basically an IOMMU bypass cache on the device used to speed up DMA, if that
was exploitable on any particular device it could then allow reading
arbitrary host memory. See e.g.
Further on the security front, it's important to note that the PCIe specs
largely predate our highly virtualised cloud world. Even extensions like
SRIOV are comparatively old, and that's not actually implemented for any
GPU of interest today (have the Firepros ever come out from behind the
marketing curtain?). Device drivers assume root-privileged code has system
level access to the hardware. There are a bunch of low-level device control
registers exposed through the device's PCI BAR0 config space - my
understanding is that there are a few sets of those registers that a guest
OS has no business accessing, e.g., power control, compatibility
interrupts, bus resets. Many of these could at least allow a malicious
guest to brick the device and/or cause the whole host to reset.
Unfortunately none of that information is in the public domain save for
what the Envy project has managed to reverse engineer:
https://envytools.readthedocs.io/en/latest/hw/bus/pci.html, quote: "Todo
nuke this file and write a better one - it sucks". Attempts to cajole
NVIDIA into releasing info on this have been largely unsuccessful, but I am
at least aware they have analysed these issues and given technical guidance
on risk mitigations to partners.
We haven't solved any of this at Monash, but we're also not running a
public cloud and only have a limited set of internal tenants/projects that
have access to our GPU flavors, so it's not a big risk to us at the moment.
As far as trying to lock some of this down goes, the good news is that QEMU
appears to have an existing mechanism in place to block/intercept accesses
to these control registers/windows (see hw/vfio/pci-quirks.c). So it's a
matter of getting a reference that can be used to add the appropriate
If the GPU clique support works for P2P that will be great. But at least
from NVIDIA's side it seems that the Linux mdev based vGPU is the way
forward (you can still have a 1:1 vGPU:pGPU allocation for heavy
workloads). Last I heard, we could expect a Linux host-side driver for this
within a month or so. There is at least one interesting architectural
complication inherent in the vGPU licensing model though, which is that the
guest vGPU drivers will need to be able to the vGPU license server/s, which
necessarily requires some link between tenant and provider networks.
Haven't played with any of this first-hand yet so not sure how problematic
(or not) it might be.
Anyway, hopefully all this is useful in some way. Perhaps if we get enough
customers pressuring NVIDIA SAs to disclose the PCIe security info, it
might get us somewhere on the road to securing passthrough.
Senior HPC Consultant
Monash eResearch Centre
Room G26, 15 Innovation Walk, Clayton Campus
Clayton VIC 3800
Office: +61 3-9903-2800 <+61%203%209903%202800>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-operators