New subject: [security-sig][nova] On reporting CPU flags that provide mitiation (to CVE flaws) as Nova 'traits'

15 May 2019

      [When replying, please keep us Cced, I'm not subscribed to the list.]

Grab a cup of tea, slightly long e-mail.  Some of us in the upstream
Nova channel are splitting hairs over this seemingly small issue about
potential "security concern" in representing some CPU flags as
"traits"[0].

Context 
-------

The 'os-traits' project lets you report CPU flags as "traits", e.g. to
configure a flavor to require the "AVX2" CPU flag:

    $> openstack flavor set 1 --property trait:HW_CPU_X86_AVX2=required

And here's the list of current CPU traits:

    https://github.com/openstack/os-traits/blob/master/os_traits/hw/cpu/x86.py

The other day I casually noticed that the above file is missing some
important CPU flags, and proposed a couple of drive-by small changes.
And that set off this massive bike-shedding exercise sized Jupier.  

Some of the following flags provide mitigation from Meltdown/Spectre,
others reduce performance degradation, yet others are 'benign' features
or are in weird zone like the just-off-the-press (for "ZombieLoad") CPU
flag: 'md-clear' -- which doesn't fix the vulnerability, but is a way to
report to the operating system that the kernel is opportunistically
issuing instructions (which, thankfully, already _exist_ in the CPU)
needed for the fix.

(See what all these acronyms mean in the rendered QEMU documentation
here[1]).

  - Intel : PCID, STIBP, SPEC-CTRL, SSBD, PDPE1GB, MD-CLEAR
  -  AMD  : IBPB, STIBP, VIRT-SSBD, AMD-SSBD, AMD-NO-SSB, PDPE1GB

Two things to distinguish
-------------------------

It's worth reminding that there are two things here: (a) allowing CPU
flags via Nova's config attribute: `[libvirt]/cpu_model_extra_flags`;
and (b) allow scheduling of instances based on CPU traits.

We're talking about case (b) here.

Possible "security issue"
-------------------------

As mentioned above, Nova has this ability to say: "DON'T land this
instance on a host if it has $TRAIT".  So, hypothetically, you can say:
"don't land on the host that has AMD-SSBD fix", which can be done (not
yet, as the patch[2] is being discussed) in one of the two ways:

  - Via placement request: `required=!HW_CPU_X86_AMD_SSBD`
  - From a flavor Extra Spec: `trait:HW_CPU_X86_AMD_SSBD=forbidden`

So, theoretically there is scope for "exploiting" (but non-trivial) the
above — however, it can be possible only when Nova exposes them via
Compute (which it doesn't yet).

Contention / unsolved question
------------------------------

Whether we should expose CPU flags (e.g. "SSBD", or "STIBP") that
provide mitigation from CPU flaws as traits or not?  It is a "policy"
decision, and the 'traits' are "forever" (well, you can soft-deprecate
them with a comment) once they're added, hence all the belaboring.

There's no consensus here.  Some think that we should _not_ allow those
CPU flags as traits which can 'allow' you to target vulnerable hosts.
Some think it is okay to add these as granular CPU traits.  (Have
a gander at the discussion on this[2] change.)

Does the Security Team has any strong opinions?

On "generic" traits
-------------------

We also discussed whether it makes sense to add "generic roll-up traits"
such as 'HW_CPU_HAS_SPECTRE_CURE' and 'HW_CPU_HAS_MELTDOWN_CURE'.
However, that seems intuitively appealinty, the messy real world doesn't
quite allow that (as there are multiple different Spectre flaws)
So, for now we won't do these "generic" triats; but it can be added
later, if we change our minds.

Next steps
----------

If there is consensus on dropping those CPU-flags-as-traits that let you
target vulnerable hosts, drop them.  And add only those CPU flags as
traits that provide either 'features' (what's the definition?) or those
that reduce performance degradation.

Otherwise, add all the required CPU flags consistently to 'os-traits',
and move on.

Another idea
------------

For "Meltdown" (and for other vulnerabilities; it should be
case-by-case, based on fix availability), we can potentially make Nova
check the 'sysfs' directory for vulnerabilities.  And if it reports
"Vulnerable" (instead of "Mitigation", as shown below):

    $> cat /sys/devices/system/cpu/vulnerabilities/meltdown
    Mitigation: PTI

Then we can print a log warning for the current release that the host is
vulnerable and warn that future Nova will refuse to run VMs on it, and
then next release, make it mandatory.

Likewise for "Spectre":

    $> grep . /sys/devices/system/cpu/vulnerabilities/spectre_*
    /sys/devices/system/cpu/vulnerabilities/spectre_v1:Mitigation: __user pointer sanitization
    /sys/devices/system/cpu/vulnerabilities/spectre_v2:Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling

Some think this is not "Nova's business", because: "just like how you
don't want to stop based on CPU fan speed or temperature or firmware
patch levels ...".  But that argument doesn't quite apply, as CPU
fan/speed are very different, and are not seen by the guest.  If you
take security seriously, it _is_ be fair game, IMHO, to make Nova warn
(then stop) launching instances on Compute hosts with vulnerable
hypervisors.  But my umbilical cord isn't tied to this idea, just wanted
to mention it for completeness' sake.

[0] https://github.com/openstack/os-traits/
[1] https://qemu.weilnetz.de/doc/qemu-doc.html#important_005fcpu_005ffeatures_00...
[2] https://review.opendev.org/#/c/655193/
    Add CPU traits for Meltdown/Spectre mitigatio

-- 
/kashyap

On reporting CPU flags that provide mitiation (to CVE flaws) as Nova 'traits'

Kashyap Chamarthy

Jeremy Stanley

tags

participants (2)