lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 25 Aug 2020 10:28:53 -0700 From: Andy Lutomirski <luto@...nel.org> To: Sean Christopherson <sean.j.christopherson@...el.com> Cc: Andy Lutomirski <luto@...nel.org>, Andrew Cooper <andrew.cooper3@...rix.com>, Thomas Gleixner <tglx@...utronix.de>, LKML <linux-kernel@...r.kernel.org>, X86 ML <x86@...nel.org>, Linus Torvalds <torvalds@...ux-foundation.org>, Tom Lendacky <thomas.lendacky@....com>, Pu Wen <puwen@...on.cn>, Stephen Hemminger <sthemmin@...rosoft.com>, Sasha Levin <alexander.levin@...rosoft.com>, Dirk Hohndel <dirkhh@...are.com>, Jan Kiszka <jan.kiszka@...mens.com>, Tony W Wang-oc <TonyWWang-oc@...oxin.com>, "H. Peter Anvin" <hpa@...ux.intel.com>, Asit Mallick <asit.k.mallick@...el.com>, Gordon Tetlow <gordon@...lows.org>, David Kaplan <David.Kaplan@....com>, Tony Luck <tony.luck@...el.com> Subject: Re: TDX #VE in SYSCALL gap (was: [RFD] x86: Curing the exception and syscall trainwreck in hardware) On Tue, Aug 25, 2020 at 10:19 AM Sean Christopherson <sean.j.christopherson@...el.com> wrote: > > On Tue, Aug 25, 2020 at 09:49:05AM -0700, Andy Lutomirski wrote: > > On Mon, Aug 24, 2020 at 9:40 PM Sean Christopherson > > <sean.j.christopherson@...el.com> wrote: > > > > > > +Andy > > > > > > On Mon, Aug 24, 2020 at 02:52:01PM +0100, Andrew Cooper wrote: > > > > And to help with coordination, here is something prepared (slightly) > > > > earlier. > > > > > > > > https://docs.google.com/document/d/1hWejnyDkjRRAW-JEsRjA5c9CKLOPc6VKJQsuvODlQEI/edit?usp=sharing > > > > > > > > This identifies the problems from software's perspective, along with > > > > proposing behaviour which ought to resolve the issues. > > > > > > > > It is still a work-in-progress. The #VE section still needs updating in > > > > light of the publication of the recent TDX spec. > > > > > > For #VE on memory accesses in the SYSCALL gap (or NMI entry), is this > > > something we (Linux) as the guest kernel actually want to handle gracefully > > > (where gracefully means not panicking)? For TDX, a #VE in the SYSCALL gap > > > would require one of two things: > > > > > > a) The guest kernel to not accept/validate the GPA->HPA mapping for the > > > relevant pages, e.g. code or scratch data. > > > > > > b) The host VMM to remap the GPA (making the GPA->HPA pending again). > > > > > > (a) is only possible if there's a fatal buggy guest kernel (or perhaps vBIOS). > > > (b) requires either a buggy or malicious host VMM. > > > > > > I ask because, if the answer is "no, panic at will", then we shouldn't need > > > to burn an IST for TDX #VE. Exceptions won't morph to #VE and hitting an > > > instruction based #VE in the SYSCALL gap would be a CPU bug or a kernel bug. > > > > Or malicious hypervisor action, and that's a problem. > > > > Suppose the hypervisor remaps a GPA used in the SYSCALL gap (e.g. the > > actual SYSCALL text or the first memory it accesses -- I don't have a > > TDX spec so I don't know the details). > > You can thank our legal department :-) > > > The user does SYSCALL, the kernel hits the funny GPA, and #VE is delivered. > > The microcode wil write the IRET frame, with mostly user-controlled contents, > > wherever RSP points, and RSP is also user controlled. Calling this a "panic" > > is charitable -- it's really game over against an attacker who is moderately > > clever. > > > > The kernel can't do anything about this because it's game over before > > the kernel has had the chance to execute any instructions. > > Hrm, I was thinking that SMAP=1 would give the necessary protections, but > in typing that out I realized userspace can throw in an RSP value that > points at kernel memory. Duh. > > One thought would be to have the TDX module (thing that runs in SEAM and > sits between the VMM and the guest) provide a TDCALL (hypercall from guest > to TDX module) to the guest that would allow the guest to specify a very > limited number of GPAs that must never generate a #VE, e.g. go straight to > guest shutdown if a disallowed GPA would go pending. That seems doable > from a TDX perspective without incurring noticeable overhead (assuming the > list of GPAs is very small) and should be easy to to support in the guest, > e.g. make a TDCALL/hypercall or two during boot to protect the SYSCALL > page and its scratch data. I guess you could do that, but this is getting gross. The x86 architecture has really gone off the rails here.
Powered by blists - more mailing lists