linux-kernel - Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <6c16fc37-bdf2-4925-8114-14f5a08c07e3@default>
Date:   Tue, 23 Jan 2018 03:13:05 -0800 (PST)
From:   Liran Alon <liran.alon@...cle.com>
To:     <dwmw2@...radead.org>
Cc:     <labbott@...hat.com>, <luto@...nel.org>,
        <Janakarajan.Natarajan@....com>, <torvalds@...ux-foundation.org>,
        <bp@...e.de>, <asit.k.mallick@...el.com>, <rkrcmar@...hat.com>,
        <dave.hansen@...el.com>, <karahmed@...zon.de>, <hpa@...or.com>,
        <mingo@...hat.com>, <jun.nakajima@...el.com>, <x86@...nel.org>,
        <ashok.raj@...el.com>, <arjan.van.de.ven@...el.com>,
        <tim.c.chen@...ux.intel.com>, <pbonzini@...hat.com>,
        <ak@...ux.intel.com>, <linux-kernel@...r.kernel.org>,
        <peterz@...radead.org>, <tglx@...utronix.de>,
        <gregkh@...uxfoundation.org>, <mhiramat@...nel.org>,
        <arjan@...ux.intel.com>, <thomas.lendacky@....com>,
        <dan.j.williams@...el.com>, <joro@...tes.org>,
        <kvm@...r.kernel.org>, <aarcange@...hat.com>
Subject: Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict
 Indirect Branch Speculation


----- dwmw2@...radead.org wrote:

> On Sun, 2018-01-21 at 14:27 -0800, Linus Torvalds wrote:
> > On Sun, Jan 21, 2018 at 2:00 PM, David Woodhouse
> <dwmw2@...radead.org> wrote:
> > >>
> > >> The patches do things like add the garbage MSR writes to the
> kernel
> > >> entry/exit points. That's insane. That says "we're trying to
> protect
> > >> the kernel".  We already have retpoline there, with less
> overhead.
> > >
> > > You're looking at IBRS usage, not IBPB. They are different
> things.
> > 
> > Ehh. Odd intel naming detail.
> > 
> > If you look at this series, it very much does that kernel
> entry/exit
> > stuff. It was patch 10/10, iirc. In fact, the patch I was replying
> to
> > was explicitly setting that garbage up.
> > 
> > And I really don't want to see these garbage patches just
> mindlessly
> > sent around.
> 
> I think we've covered the technical part of this now, not that you
> like
> it — not that any of us *like* it. But since the peanut gallery is
> paying lots of attention it's probably worth explaining it a little
> more for their benefit.
> 
> This is all about Spectre variant 2, where the CPU can be tricked
> into
> mispredicting the target of an indirect branch. And I'm specifically
> looking at what we can do on *current* hardware, where we're limited
> to
> the hacks they can manage to add in the microcode.
> 
> The new microcode from Intel and AMD adds three new features.
> 
> One new feature (IBPB) is a complete barrier for branch prediction.
> After frobbing this, no branch targets learned earlier are going to
> be
> used. It's kind of expensive (order of magnitude ~4000 cycles).
> 
> The second (STIBP) protects a hyperthread sibling from following
> branch
> predictions which were learned on another sibling. You *might* want
> this when running unrelated processes in userspace, for example. Or
> different VM guests running on HT siblings.
> 
> The third feature (IBRS) is more complicated. It's designed to be
> set when you enter a more privileged execution mode (i.e. the
> kernel).
> It prevents branch targets learned in a less-privileged execution
> mode,
> BEFORE IT WAS MOST RECENTLY SET, from taking effect. But it's not
> just
> a 'set-and-forget' feature, it also has barrier-like semantics and
> needs to be set on *each* entry into the kernel (from userspace or a
> VM
> guest). It's *also* expensive. And a vile hack, but for a while it
> was
> the only option we had.
> 
> Even with IBRS, the CPU cannot tell the difference between different
> userspace processes, and between different VM guests. So in addition
> to
> IBRS to protect the kernel, we need the full IBPB barrier on context
> switch and vmexit. And maybe STIBP while they're running.
> 
> Then along came Paul with the cunning plan of "oh, indirect branches
> can be exploited? Screw it, let's not have any of *those* then",
> which
> is retpoline. And it's a *lot* faster than frobbing IBRS on every
> entry
> into the kernel. It's a massive performance win.
> 
> So now we *mostly* don't need IBRS. We build with retpoline, use IBPB
> on context switches/vmexit (which is in the first part of this patch
> series before IBRS is added), and we're safe. We even refactored the
> patch series to put retpoline first.
> 
> But wait, why did I say "mostly"? Well, not everyone has a retpoline
> compiler yet... but OK, screw them; they need to update.
> 
> Then there's Skylake, and that generation of CPU cores. For
> complicated
> reasons they actually end up being vulnerable not just on indirect
> branches, but also on a 'ret' in some circumstances (such as 16+
> CALLs
> in a deep chain).
> 
> The IBRS solution, ugly though it is, did address that. Retpoline
> doesn't. There are patches being floated to detect and prevent deep
> stacks, and deal with some of the other special cases that bite on
> SKL,
> but those are icky too. And in fact IBRS performance isn't anywhere
> near as bad on this generation of CPUs as it is on earlier CPUs
> *anyway*, which makes it not quite so insane to *contemplate* using
> it
> as Intel proposed.
> 
> That's why my initial idea, as implemented in this RFC patchset, was
> to
> stick with IBRS on Skylake, and use retpoline everywhere else. I'll
> give you "garbage patches", but they weren't being "just mindlessly
> sent around". If we're going to drop IBRS support and accept the
> caveats, then let's do it as a conscious decision having seen what it
> would look like, not just drop it quietly because poor Davey is too
> scared that Linus might shout at him again. :)
> 
> I have seen *hand-wavy* analyses of the Skylake thing that mean I'm
> not
> actually lying awake at night fretting about it, but nothing concrete
> that really says it's OK.
> 
> If you view retpoline as a performance optimisation, which is how it
> first arrived, then it's rather unconventional to say "well, it only
> opens a *little* bit of a security hole but it does go nice and fast
> so
> let's do it".
> 
> But fine, I'm content with ditching the use of IBRS to protect the
> kernel, and I'm not even surprised. There's a *reason* we put it last
> in the series, as both the most contentious and most dispensable
> part.
> I'd be *happier* with a coherent analysis showing Skylake is still
> OK,
> but hey-ho, screw Skylake.
> 
> The early part of the series adds the new feature bits and detects
> when
> it can turn KPTI off on non-Meltdown-vulnerable Intel CPUs, and also
> supports the IBPB barrier that we need to make retpoline complete.
> That
> much I think we definitely *do* want. There have been a bunch of us
> working on this behind the scenes; one of us will probably post that
> bit in the next day or so.
> 
> I think we also want to expose IBRS to VM guests, even if we don't
> use
> it ourselves. Because Windows guests (and RHEL guests; yay!) do use
> it.
> 
> If we can be done with the shouty part, I'd actually quite like to
> have
> a sensible discussion about when, if ever, we do IBPB on context
> switch
> (ptraceability and dumpable have both been suggested) and when, if
> ever, we set STIPB in userspace.

It is also important to note that current solutions, as I understand it, still have info-leak issues.

If retpoline is being used, user-mode code can leak RSB entries created while CPU was in kernel-mode.
Therefore, breaking KASLR. In order to handle this, every exit from kernel-mode to user-mode should stuff RSB. In addition, this stuffing of RSB may need to be done from a fixed address to avoid leaking the address of the RSB stuffing itself. Same concept applies for VMEntry into guests. Hypervisor should stuff RSB just before VMEntry, otherwise guest will be able to leak RSB entries which reveals hypervisor addresses.

If IBRS is being used, things seems to be even worse.
IBRS prevents BTB entries created at lower prediction-mode from being used by higher prediction-mode code.
However, nothing seems to prevent lower prediction-mode code from using BTB entries of higher prediction-mode code. This means that user-mode code could leak BTB entries in order to break KASLR and guests could leaks host's BTB entries to reveal hypervisor addresses. This seems to be an issue even with future CPUs that will have "IBRS all-the-time" feature.
Note that this issue is not theoretical. This is exactly what Google's Project-Zero KVM PoC did. They leaked host's BTB entries to reveal kvm-intel.ko, kvm.ko & vmlinux addresses.
It seems that the correct way to really handle this scenario should be to tag every BTB entry with prediction-mode and make CPU only use BTB entries tagged with current prediction-mode. Therefore, entirely separating the BTB entries between prediction-modes. That, in my opinion, should replace the IBRS-feature.

-Liran