Re: On reporting CPU flags that provide mitiation (to CVE flaws) as Nova 'traits'
On Wed, May 15, 2019 at 11:49:03AM +0100, Sean Mooney wrote:
On Wed, 2019-05-15 at 11:24 +0200, Kashyap Chamarthy wrote:
[...]
Contention / unsolved question ------------------------------
Whether we should expose CPU flags (e.g. "SSBD", or "STIBP") that provide mitigation from CPU flaws as traits or not? It is a "policy" decision, and the 'traits' are "forever" (well, you can soft-deprecate them with a comment) once they're added, hence all the belaboring.
There's no consensus here. Some think that we should _not_ allow those CPU flags as traits which can 'allow' you to target vulnerable hosts.
for what its worth im in this camp and have said so in other places where we have been disucssing it.
Yep, noted.
Some think it is okay to add these as granular CPU traits. (Have a gander at the discussion on this[2] change.)
Does the Security Team has any strong opinions?
[...]
Next steps ----------
If there is consensus on dropping those CPU-flags-as-traits that let you target vulnerable hosts, drop them. And add only those CPU flags as traits that provide either 'features' (what's the definition?) or those that reduce performance degradation.
my vote is for only adding tratis for cpu featrue.
Noted; I'd like to hear other opinions. (And note that the word "feature" can get fuzzy in this context, I'll assume we're using it somewhat loosely to include things that help with reducing perf degradation, etc.)
PCID is a CPU feautre that was designed as a performce optiomistation
... except that "feature" was a 'no-op' and it wasn't even _used_, until Linux 4.1.4 enabled it (in November 2017) for Meltdown mitigation. So the presence of PCID in the hardware didn't matter one whit all these decades. (Source: http://archive.is/ma8Iw.)
and several generation later also was found to be useful in reducing the performace impacts of the sepcter mitigation
Nit: Not Spectre, but Meltdown. [...]
Some think this is not "Nova's business", because: "just like how you don't want to stop based on CPU fan speed or temperature or firmware patch levels ...".
i think it applies perfectly.
It's a matter of scope. To be clear — I'm not "insisting" that it be done in Nova. Just thinking out loud. [...]
form a product perspective vendors shoudl ensure that they provide tooling and software updated that are secure by default
"Product perspective" is irrelevant here. Of course, it's obvious that vendors "should" provide the relevant tooling and sofware updates.
But that argument doesn't quite apply, as CPU fan/speed are very different, and are not seen by the guest. If you take security seriously, it _is_ be fair game, IMHO, to make Nova warn (then stop) launching instances on Compute hosts with vulnerable
Correcting myself: Okay, "stopping" / "refusing to launch" is too strict and unresonable; scratch that. (Because, as discussed before, there _are_ valid cases to be made that certain admins/operators intentionally will run on vulnerable hypervisors — e.g. because their CPUs are too old to receive microcode updates. Or may deliberately tolerate this risk, as they know their risk policy. Or they're running staging envs, or any number of other reasons.)
hypervisors.
the same aregument could be aplied to qemu or libvirt.
No, that argument does not apply to QEMU or libvirt. Why? QEMU and libvirt are low-level primitives. They explicitly state that they don't, and will not, make such "policy" decisions. But Nova, as a management tool, _does_ make some policy decisions (e.g. how we generate a libvirt guest XML based on certain criteria, and others). And in this case, Nova _can_ take a stance that "orchestration tools" should do that — that's perfectly acceptable. [...] -- /kashyap
(NB: I'm explicitly rendering "no opinion" on several items below so you know I didn't miss/ignore them.)
The other day I casually noticed that the above file is missing some important CPU flags
I think this is noteworthy. These traits are being proposed because you casually noticed they were missing, not because someone asked for them. We can invent use cases, but without demand we may just be spinning our wheels.
So, theoretically there is scope for "exploiting" (but non-trivial) it is trivial all you would have to do is
I'm not a security guy, but I'm pretty sure it doesn't matter whether it's trivial; if it's possible at all, that's bad. That being the case, you don't even have to be able to target a vulnerable host for it to be a security problem. If my cloud is set up so that Joe Hacker is able to land his instance on a vulnerable host even by randomly trying, I done effed up already.
There's no consensus here. Some think that we should _not_ allow those CPU flags as traits which can 'allow' you to target vulnerable hosts.
for what its worth im in this camp and have said so in other places where we have been disucssing it.
Yep, noted.
My position is that it's not harmful to add them to os-traits; it's whether/how they're used in nova that needs some thought.
Does the Security Team has any strong opinions?
Still hoping someone speaks up in this capacity...
If there is consensus on dropping those CPU-flags-as-traits that let you target vulnerable hosts, drop them. And add only those CPU flags as traits that provide either 'features' (what's the definition?) or those that reduce performance degradation.
my vote is for only adding tratis for cpu featrue.
Noted; I'd like to hear other opinions. (And note that the word "feature" can get fuzzy in this context, I'll assume we're using it somewhat loosely to include things that help with reducing perf degradation, etc.)
I abstain. Once again, presence in os-traits is harmless; use by nova is subject to further discussion. But we also don't have any demand (that I'm aware of). However, I'll state again for the record that vendor-specific "positive" traits (indicating "has mitigation", "not vulnerable", etc.) are nigh worthless for the Nova scheduling use case of "land me on a non-vulnerable host" because, until you can say required=in:HW_CPU_X86_INTEL_FIX,HW_CPU_X86_AMD_FIX, you would have to pick your CPU vendor ahead of time.
PCID is a CPU feautre that was designed as a performce optiomistation
I'm staying well away from the what-is-a-feature discussion, mainly out of ignorance.
Some think this is not "Nova's business", because: "just like how you don't want to stop based on CPU fan speed or temperature or firmware patch levels ...".
IMO this (cpu flags/features/attributes, and even possibly firmware patch levels, though probably not fan speed or temperature) is a perfectly suitable use of traits. Not all traits have to feed into Nova scheduling decisions; they could also be used by e.g. external orchestrators. os-traits needs to have that more global not-just-Nova perspective. (Disclaimer: I'm a card-carrying "trait libertarian": freedom to do what makes sense with traits, as long as you're not hurting anyone and it's not costing the taxpayers.)
Okay, "stopping" / "refusing to launch" is too strict and unresonable; scratch that.
I agree with this, for all the reasons stated.
we can potentially make Nova check the 'sysfs' directory for vulnerabilities.
IMO this is still a good idea, but rather than warning / refusing to boot, we could expose a roll-up trait, subject to the strawman design below. To summarize my position on the os-traits side of things: - We can merge the feature-ish traits (assuming folks can agree on which ones those are). - We can merge the vulnerability traits as long as they come with nice comments explaining the potential security pitfalls around using them. - Or for all I care we can merge nothing, since we don't actually seem to have a demand for it. ========================== I'm going to dive into Nova-land now. The below would need a blueprint and a spec. And an owner. And it would be nice if it also had demand. If we want to make scheduling decisions based on vulnerabilities, it needs to be under the exclusive control of the admin. As mentioned above, exposing the traits and allowing untrusted/untrustworthy users to target vulnerable hosts is only marginally worse than having those vulnerable hosts available to said untrusted users at all. So if we are going to have virt drivers expose a VULNERABLE trait in any form, it should come with: 1) a config option in the spirit of: [scheduler] allow_scheduling_to_vulnerable_hosts = $bool (default: False) which, when False, causes the scheduler to add trait:VULNERABLE=forbidden to *all* GET /a_c requests. But we should generalize this to: (a) Maintain a hardcoded list of traits that represent vulnerabilities or other undesirables (b) Have the conf option be [scheduler]evil_trait_whitelist (c) Add [trait:$X=forbidden for $X in {(b) - (a)}] 2) a hard check to disallow trait:$X=required from *anywhere* (flavor, image, etc.) regardless of the conf option. Either reject the boot request or explicitly strip that out. For completeness, note that these traits need to be "negative" (i.e. "has vulnerability") so that we can forbid them in a list in the GET /a_c request. Because required=!INTEL_VULNERABLE,!AMD_VULNERABLE will correctly avoid vulnerable hosts from either vendor, but required=INTEL_FIXED,AMD_FIXED won't land anywhere, and we don't have required=in:INTEL_FIXED,AMD_FIXED yet. efried .
On 2019-05-15 16:50:58 -0500 (-0500), Eric Fried wrote:
(NB: I'm explicitly rendering "no opinion" on several items below so you know I didn't miss/ignore them.) [...]
(NBNB: Kashyap asked to be Cc'd as a non-subscriber, so I have added him back but you may want to forward him a copy of your reply.)
Does the Security Team has any strong opinions?
Still hoping someone speaks up in this capacity... [...]
I've added a link to this thread on the agenda for tomorrow's Security SIG meeting[*] in order to attempt to raise the visibility a bit with other members of the SIG. [*] http://eavesdrop.openstack.org/#Security_SIG_meeting -- Jeremy Stanley
On Thu, May 16, 2019 at 6:14 AM Jeremy Stanley <fungi@yuggoth.org> wrote:
On 2019-05-15 16:50:58 -0500 (-0500), Eric Fried wrote:
(NB: I'm explicitly rendering "no opinion" on several items below so you know I didn't miss/ignore them.) [...]
(NBNB: Kashyap asked to be Cc'd as a non-subscriber, so I have added him back but you may want to forward him a copy of your reply.)
Does the Security Team has any strong opinions?
Still hoping someone speaks up in this capacity... [...]
I've added a link to this thread on the agenda for tomorrow's Security SIG meeting[*] in order to attempt to raise the visibility a bit with other members of the SIG.
[*] http://eavesdrop.openstack.org/#Security_SIG_meeting -- Jeremy Stanley
I'm actually on the side of adding all the traits (cpu flags) and letting the operator make sure that their cloud is patched. We don't want to make assumption on behalf of the user, if I am $chip_manufacturer and I want to use OpenStack to do CI for regression testing of these, then I don't have the ability to do it. The solution of introducing a flag that says "it's okay if it's vulnerable" opens a whole can of worms on a) keeping up to date with all thee different vulnerabilities and b) potentially causing a lot of upgrade surprises when all of a sudden the flag you relied on is now all of a sudden a "banned" one. I think we should empower our operators and let them decide what to do with their clouds. These recent CPU vulnerabilities are very 'massive' in terms of "PR" so usually most people know about them. -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com
On 2019-05-15 22:11:46 +0000 (+0000), Jeremy Stanley wrote: [...]
Kashyap asked to be Cc'd as a non-subscriber [...]
Oops, as Eric rightly pointed out to me just now, Kashyap is a subscriber to openstack-discuss, just not to openstack-security where he originally sent it. My mistake! -- Jeremy Stanley
IMO this (cpu flags/features/attributes, and even possibly firmware patch levels, though probably not fan speed or temperature) is a perfectly suitable use of traits. Not all traits have to feed into Nova scheduling decisions; they could also be used by e.g. external orchestrators. os-traits needs to have that more global not-just-Nova perspective.
Clearly not everything has to feed into a Nova scheduling decision, by virtue of placement hoping to cater to things other than nova. That said, I do think that placement should try to avoid being "tags as a service" which this use-case is dangerously close to becoming, IMHO.
Okay, "stopping" / "refusing to launch" is too strict and unresonable; scratch that.
I agree with this, for all the reasons stated.
Me too, and that'd be a Nova decision to do anything with the security flag or not.
we can potentially make Nova check the 'sysfs' directory for vulnerabilities.
IMO this is still a good idea, but rather than warning / refusing to boot, we could expose a roll-up trait, subject to the strawman design below.
And I think it's a bad idea. Honestly, if we're going to do this, why not query yum/apt and set a trait for has-updates-pending? Or has-major-update-available? Or dell-tells-us-there-is-a-bios-update-for-this-machine? Where does it end? Obviously I think it's up to the placement team to decide if they're going to put has-updates-pending in the set of standard traits. I'd vote for no, and Jay will be turning over in his grave shortly. However, I strenuously object to Nova becoming the agent for everything on the compute node, software, hardware, etc. If we're going to peek into kernel updatey things, I don't see how we explain to the next person that it's not okay to check to see if firefox is up to date. Further, if we do get into this business, who is to say that in the future, Nova doesn't get a CVE for failing to notice and report something? Like, do we need to put nova in the embargo box since it claims to be able to tell you if your stuff is vulnerable or not?
To summarize my position on the os-traits side of things:
- We can merge the feature-ish traits (assuming folks can agree on which ones those are). - We can merge the vulnerability traits as long as they come with nice comments explaining the potential security pitfalls around using them. - Or for all I care we can merge nothing, since we don't actually seem to have a demand for it.
Every vendor has a tool dedicated to monitoring for updates, applicable vulnerabilities, and for orchestrating that work. A deployment of any appreciable size monitors hardware inventory and can answer the questions of which hosts need a patch without having to ask Nova about it. There are plenty of reasons why you might not apply one update at all or on a specifc schedule. This is well outside of Nova's scope.
The below would need a blueprint and a spec. And an owner. And it would be nice if it also had demand.
If we want to make scheduling decisions based on vulnerabilities, it needs to be under the exclusive control of the admin. As mentioned above, exposing the traits and allowing untrusted/untrustworthy users to target vulnerable hosts is only marginally worse than having those vulnerable hosts available to said untrusted users at all. So if we are going to have virt drivers expose a VULNERABLE trait in any form, it should come with:
Further, if placement is ever exposed to middle admins (i.e. domain admins, site admins in a larger deployment, etc) even read-only, presumably you'll need to be able to expose (or hide) the presence of a trait based on their security clearance.
1) a config option in the spirit of:
[scheduler] allow_scheduling_to_vulnerable_hosts = $bool (default: False)
which, when False, causes the scheduler to add trait:VULNERABLE=forbidden to *all* GET /a_c requests.
But we should generalize this to:
(a) Maintain a hardcoded list of traits that represent vulnerabilities or other undesirables (b) Have the conf option be [scheduler]evil_trait_whitelist (c) Add [trait:$X=forbidden for $X in {(b) - (a)}]
2) a hard check to disallow trait:$X=required from *anywhere* (flavor, image, etc.) regardless of the conf option. Either reject the boot request or explicitly strip that out.
For completeness, note that these traits need to be "negative" (i.e. "has vulnerability") so that we can forbid them in a list in the GET /a_c request. Because required=!INTEL_VULNERABLE,!AMD_VULNERABLE will correctly avoid vulnerable hosts from either vendor, but required=INTEL_FIXED,AMD_FIXED won't land anywhere, and we don't have required=in:INTEL_FIXED,AMD_FIXED yet.
I'm strong -3 on exposing VULNERABLE or NOT_VULNERABLE and +2 on SUPPORTS_SOMEACTUALCPUFLAG. It's trivial today for an operator to nova-disable all computes, and start enabling them as they are patched (automatically, with their patching tool). --Dan
On May 15, 2019, at 5:31 PM, Dan Smith <dms@danplanet.com> wrote:
That said, I do think that placement should try to avoid being "tags as a service" which this use-case is dangerously close to becoming, IMHO.
This. -- Ed Leafe
On Wed, 15 May 2019, Eric Fried wrote:
(NB: I'm explicitly rendering "no opinion" on several items below so you know I didn't miss/ignore them.)
I'm responding in this thread so that it's clear I'm not ignoring it. I don't have a strong opinion. I agree that availability of a trait in os-traits is not the same as nova reporting that trait when creating resource providers representing compute nodes. However, having something in os-traits that nobody is going to use is not without cost: Once something is in os-traits it must stay there forever. So if there's no pressing use case for these additions, maybe we just wait. Bit more within...
However, I'll state again for the record that vendor-specific "positive" traits (indicating "has mitigation", "not vulnerable", etc.) are nigh worthless for the Nova scheduling use case of "land me on a non-vulnerable host" because, until you can say required=in:HW_CPU_X86_INTEL_FIX,HW_CPU_X86_AMD_FIX, you would have to pick your CPU vendor ahead of time.
There's a spec for this, but it is currently on hold as there is neither immediate use cases demanding to be satisfied, nor anyone to do the work. https://review.opendev.org/649992
(Disclaimer: I'm a card-carrying "trait libertarian": freedom to do what makes sense with traits, as long as you're not hurting anyone and it's not costing the taxpayers.)
From a placement-the-service standpoint, it cares naught. It doesn't know what traits mean and cannot distinguish between official and custom traits when filtering candidates. It's important that
I guess that makes me a "trait anarcho communitarian". People should have the freedom to do what they like with traits and they aren't hurting anybody, but blessing a trait as official (by putting it in os-traits) is a strong signifier and has system-wide impacts that should be debated in ad-hoc committees endlessly until a consensus emerges which avoids anyone facepalming or rage quitting. placement be able to work easily with thousands or hundreds of thousands of traits. We very definitely do not wanting to making authorization decisions based on the value of a trait and the status of the requestor. As said elsewhere by several folk: It's how the other services use them that matters. I'm agnostic on nova reporting all the cpu flags/features/capabilities as traits. If it is going to do that, then having _those_ traits as members of os-traits is the right thing to do. I'm less agnostic on users ever needing or wanting to be aware of specific cpu features in order to get a satisfactory workload placement. I want to be able to request high performance without knowing the required underlying features. Flavors + traits (which I don't have to understand) gets us that, so ... cool.
If we want to make scheduling decisions based on vulnerabilities, it needs to be under the exclusive control of the admin.
Others have said this (at least Dan): This seems like something where something other than nova ought to handle it. A host which shouldn't be scheduled to should be disabled (as a service). -=-=- This thread and several other conversations about traits and resource classes have made it pretty clear that the knowledge and experience required to make good decisions about what names should be in os-traits and os-resource-classes (and the form the names should take) is not exactly overlapping with what's required to be a core on the placement service. How do people feel about the idea of forming a core group for those two repos that includes placement cores but has additions from nova (Dan, Kashyap and Sean would make good candidates) and other projects that consume them? Having that group wouldn't remove the need for these extended conversations but would help make sure the right people were aware of changes and participating. -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent
I've added a link to this thread on the agenda for tomorrow's Security SIG meeting
This happened [1]. TL;DR: it does more potential good than harm to expose these traits ("scheduler roulette is not a security measure" --fungi).
Others have said this (at least Dan): This seems like something where something other than nova ought to handle it. A host which shouldn't be scheduled to should be disabled (as a service).
WFM. Scrap strawman. Given that it's not considered a security issue, we could expose the (low-level, CPU flag) traits so that "other than nova" can use them. If we think there's demand.
How do people feel about the idea of forming a core group for those two repos that includes placement cores but has additions from nova (Dan, Kashyap and Sean would make good candidates) and other projects that consume them?
++ efried [1] http://eavesdrop.openstack.org/irclogs/%23openstack-meeting/%23openstack-mee...
On 2019-05-16 10:42:47 -0500 (-0500), Eric Fried wrote: [...]
I've added a link to this thread on the agenda for tomorrow's Security SIG meeting
This happened [1]. TL;DR: it does more potential good than harm to expose these traits ("scheduler roulette is not a security measure" --fungi). [...]
To reiterate my position from the SIG meeting, I only really care whether or not processes which need to know about CPU details for enabling modes to cope with these vulnerabilities have access to them (generally so they can drop their own inefficient mitigations when presented with CPU flags which indicate they're unnecessary because the relevant microcode has been installed on the host or the particular chip lacks that design flaw entirely). That is security-relevant. Whether users want to be able to make scheduling choices based on those same flags, and whether the operators of those environments want to grant them the ability to do so, isn't really a security-relevant discussion point. I support providing a means for users to get good performance on secure systems by default. Anyone who wants to knowingly choose less secure systems to gain a performance boost, or to intentionally shuffle specific customer workloads onto less secure parts of their infrastructure is welcome to those features, but I don't consider that to really be a security topic. At that point it's more of a discussion about people making (hopefully well-informed) trade-offs for the sake of performance and efficiency. -- Jeremy Stanley
On Thu, May 16, 2019 at 10:42:47AM -0500, Eric Fried wrote:
I've added a link to this thread on the agenda for tomorrow's Security SIG meeting
This happened [1]. TL;DR: it does more potential good than harm to expose these traits ("scheduler roulette is not a security measure" --fungi).
Thanks for the summary, Eric. I've just read the relevant IRC log discussion. Thanks to everyone who's chimed in (Jeremey, et al).
Others have said this (at least Dan): This seems like something where something other than nova ought to handle it. A host which shouldn't be scheduled to should be disabled (as a service).
WFM. Scrap strawman.
ACK.
Given that it's not considered a security issue, we could expose the (low-level, CPU flag) traits so that "other than nova" can use them. If we think there's demand.
Okay, so I take it that all the relevant low-level CPU flags (including things like SSBD, et al) as proposed here[2][3] can be added to 'os-traits'. And tools _other_ than Nova can consume, if need be. Correct me if I misparsed.
How do people feel about the idea of forming a core group for those two repos that includes placement cores but has additions from nova (Dan, Kashyap and Sean would make good candidates) and other projects that consume them?
I'm fine participating, if I can provide useful input.
++
efried
[1] http://eavesdrop.openstack.org/irclogs/%23openstack-meeting/%23openstack-mee...
[2] https://review.opendev.org/#/c/655193/4/os_traits/hw/cpu/x86.py [3] https://review.opendev.org/#/c/655193/4/os_traits/hw/cpu/amd.py -- /kashyap
Okay, so I take it that all the relevant low-level CPU flags (including things like SSBD, et al) as proposed here[2][3] can be added to 'os-traits'.
Yes, subject to already-noted namespacing and spelling issues.
And tools _other_ than Nova can consume, if need be.
Nova should consume by having the driver expose the flags as appropriate. And switching on flaggage in domain xml if that's a thing. But that's all. No efforts to special-case scheduling decisions etc. efried .
On Fri, May 17, 2019 at 11:25:24AM -0500, Eric Fried wrote:
Okay, so I take it that all the relevant low-level CPU flags (including things like SSBD, et al) as proposed here[2][3] can be added to 'os-traits'.
Yes, subject to already-noted namespacing and spelling issues.
Noted.
And tools _other_ than Nova can consume, if need be.
Nova should consume by having the driver expose the flags as appropriate. And switching on flaggage in domain xml if that's a thing. But that's all. No efforts to special-case scheduling decisions etc.
Nod; thanks for clarifying, Eric. -- /kashyap
On May 15, 2019, at 4:50 PM, Eric Fried <openstack@fried.cc> wrote:
There's no consensus here. Some think that we should _not_ allow those CPU flags as traits which can 'allow' you to target vulnerable hosts.
for what its worth im in this camp and have said so in other places where we have been disucssing it.
Yep, noted.
My position is that it's not harmful to add them to os-traits; it's whether/how they're used in nova that needs some thought.
They may not be "harmful", but they set a very bad precedent. I don't want to see os-traits become "Oh, just dump the trait in there, and maybe someday someone will use it". -- Ed Leafe
participants (7)
-
Chris Dent
-
Dan Smith
-
Ed Leafe
-
Eric Fried
-
Jeremy Stanley
-
Kashyap Chamarthy
-
Mohammed Naser