[Openstack-operators] outstanding issues with GPU passthrough
blair.bethwaite at monash.edu
Tue Mar 20 08:04:33 UTC 2018
I forgot to specifically address one of the questions that Belmiro raised,
which is regarding device clean-up. I guess this would be relevant to
Ironic bare-metal clouds too.
If the hypervisor has blocked access to problematic areas of the PCI config
space then this probably isn't necessary, but as I mentioned below, this
isn't happening with QEMU/KVM today.
I asked an NVIDIAn about forcing a firmware flash on the device as a
possible means to ensure the firmware is correct (something that could be
done between device allocations by some management layer like Cyborg). He
told this would definitely not be recommended for the risk of bricking the
device, besides I couldn't find any tools that do this. Apparently the
firmware is signed. However there doesn't seem to be any publicly available
technical detail on the signing process, so I don't know whether it enables
the device to verify the source of a firmware write, or if it's just
something that NVIDIA's own drivers check by reading the firmware ROM.
On Tue., 20 Mar. 2018, 17:47 Blair Bethwaite, <blair.bethwaite at monash.edu>
> Hi all,
> This has turned into a bit of a screed I'm afraid...
> Last week I had pings from both the folks at CERN and Catalyst Cloud about
> GPU-accelerated OpenStack instances with PCI passthrough, specifically
> asking about the security issues I've mentioned in community forums
> previously, and any other gotchas. So I figured I'd attempt to answer what
> I can on-list and hopefully others will share too. I'll also take an action
> to propose a Forum session something along the lines of "GPUs - state of
> the art & practice" where we can come together in Vancouver to discuss
> further. Just writing this has prompted me to pick-up where I last left off
> on this - currently looking for some upstream QEMU expertise...
> Firstly, there is this wiki page: https://wiki.openstack.org/wiki/GPUs,
> which should be relevant but could be expanded based on this discussion.
> **Issues / Gotchas**
> These days for the most part things just seem to work if you follow the
> basic docs and then ensure the usual things required with vfio-pci, like
> making sure no other host drivers are bound to the device/s. These are the
> basics which have been covered in a number of Summit presentations, most
> recently see
> for a good overview. This blog http://www.vscaler.com/gpu-passthrough/,
> although a little dated, is still relevant. Perhaps it would be worth
> adding these things to docs or wiki.
> One issue that we've hit a couple of times now, most recently only last
> week, is with apparmor on Ubuntu being too restrictive when the passthrough
> device needs to be reattached post-snapshot. This has been discussed
> on-list in the past - see "[Openstack-operators] PCI Passthrough issues"
> for a good post-mortem from Jon at MIT. So far I'm not sure if this most
> recent incarnation is due to a new bug with newer cloudarchive hypervisor
> stack, or because we have stale templates in our Puppet that are
> overwriting something we should be getting from the package-shipped
> apparmor rules - if it turns out to be a new bug we'll report upstream...
> Perhaps more concerning for new deployers today (assuming deep-learning is
> a major motivator for adopting this capability) is that GPU P2P doesn't
> work inside a typical guest instance with multiple GPUs passed through.
> That's because the emulated flat PCI (not even PCIe) topology inside the
> guest will make the device drivers think this isn't possible. However, GPU
> clique support was added to QEMU 2.11 in
> There's no Libvirt support for this yet, so I'd expect it to be at least
> another couple of cycles before we might see this hitting Nova. In any
> case, we are about to start kicking the tires on it and will report back.
> The big issue that's inherent with PCI passthrough is that you have to
> give up the whole device (and arguably whole server if you are really
> concerned about security). There is also potential complexity with switched
> PCIe topologies, which you're likely to encounter on any host with more
> than a couple of GPUs - if you have "wrong" PCIe chipset then you may not
> be able to properly isolate the devices from each other. I believe the
> hyperscalers may use PCIe switching fabrics with external device housing,
> as opposed to GPUs directly inside the host. They have gone to some effort
> to ensure things like PCIe Address Translation Services (ATS) get turned
> off - ATS is basically an IOMMU bypass cache on the device used to speed up
> DMA, if that was exploitable on any particular device it could then allow
> reading arbitrary host memory. See e.g.
> Further on the security front, it's important to note that the PCIe specs
> largely predate our highly virtualised cloud world. Even extensions like
> SRIOV are comparatively old, and that's not actually implemented for any
> GPU of interest today (have the Firepros ever come out from behind the
> marketing curtain?). Device drivers assume root-privileged code has system
> level access to the hardware. There are a bunch of low-level device control
> registers exposed through the device's PCI BAR0 config space - my
> understanding is that there are a few sets of those registers that a guest
> OS has no business accessing, e.g., power control, compatibility
> interrupts, bus resets. Many of these could at least allow a malicious
> guest to brick the device and/or cause the whole host to reset.
> Unfortunately none of that information is in the public domain save for
> what the Envy project has managed to reverse engineer:
> https://envytools.readthedocs.io/en/latest/hw/bus/pci.html, quote: "Todo
> nuke this file and write a better one - it sucks". Attempts to cajole
> NVIDIA into releasing info on this have been largely unsuccessful, but I am
> at least aware they have analysed these issues and given technical guidance
> on risk mitigations to partners.
> We haven't solved any of this at Monash, but we're also not running a
> public cloud and only have a limited set of internal tenants/projects that
> have access to our GPU flavors, so it's not a big risk to us at the moment.
> As far as trying to lock some of this down goes, the good news is that QEMU
> appears to have an existing mechanism in place to block/intercept accesses
> to these control registers/windows (see hw/vfio/pci-quirks.c). So it's a
> matter of getting a reference that can be used to add the appropriate
> If the GPU clique support works for P2P that will be great. But at least
> from NVIDIA's side it seems that the Linux mdev based vGPU is the way
> forward (you can still have a 1:1 vGPU:pGPU allocation for heavy
> workloads). Last I heard, we could expect a Linux host-side driver for this
> within a month or so. There is at least one interesting architectural
> complication inherent in the vGPU licensing model though, which is that the
> guest vGPU drivers will need to be able to the vGPU license server/s, which
> necessarily requires some link between tenant and provider networks.
> Haven't played with any of this first-hand yet so not sure how problematic
> (or not) it might be.
> Anyway, hopefully all this is useful in some way. Perhaps if we get enough
> customers pressuring NVIDIA SAs to disclose the PCIe security info, it
> might get us somewhere on the road to securing passthrough.
> Blair Bethwaite
> Senior HPC Consultant
> Monash eResearch Centre
> Monash University
> Room G26, 15 Innovation Walk, Clayton Campus
> Clayton VIC 3800
> Mobile: 0439-545-002
> Office: +61 3-9903-2800 <+61%203%209903%202800>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-operators