linux-kernel - Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALCETrUZm3KsonYVd5sn=LhoGZ2ciO7xT_Fz=jD_HZ04tB9o=Q@mail.gmail.com>
Date:   Wed, 5 Sep 2018 14:31:28 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Andy Lutomirski <luto@...nel.org>, X86 ML <x86@...nel.org>,
        Borislav Petkov <bp@...en8.de>,
        LKML <linux-kernel@...r.kernel.org>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        Adrian Hunter <adrian.hunter@...el.com>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        Joerg Roedel <joro@...tes.org>, Jiri Olsa <jolsa@...hat.com>,
        Andi Kleen <ak@...ux.intel.com>
Subject: Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

On Tue, Sep 4, 2018 at 12:04 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
>> The SYSCALL64 trampoline has a couple of nice properties:
>>
>>  - The usual sequence of SWAPGS followed by two GS-relative accesses to
>>    set up RSP is somewhat slow because the GS-relative accesses need
>>    to wait for SWAPGS to finish.  The trampoline approach allows
>>    RIP-relative accesses to set up RSP, which avoids the stall.
>>
>>  - The trampoline avoids any percpu access before CR3 is set up,
>>    which means that no percpu memory needs to be mapped in the user
>>    page tables.  This prevents using Meltdown to read any percpu memory
>>    outside the cpu_entry_area and prevents using timing leaks
>>    to directly locate the percpu areas.
>>
>> The downsides of using a trampoline may outweigh the upsides, however.
>> It adds an extra non-contiguous I$ cache line to system calls, and it
>> forces an indirect jump to transfer control back to the normal kernel
>> text after CR3 is set up.  The latter is because x86 lacks a 64-bit
>> direct jump instruction that could jump from the trampoline to the entry
>> text.  With retpolines enabled, the indirect jump is extremely slow.
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless.  It
>> does allow a timing attack to locate the percpu area, but KASLR is
>> more or less a lost cause against local attack on CPUs vulnerable to
>> Meltdown regardless.  As far as I'm concerned, on current hardware,
>> KASLR is only useful to mitigate remote attacks that try to attack
>> the kernel without first gaining RCE against a vulnerable user
>> process.
>>
>> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
>> syscall overhead from ~237ns to ~228ns.
>>
>> There is a possible alternative approach: we could instead move the
>> trampoline within 2G of the entry text and make a separate copy for
>> each CPU.  Then we could use a direct jump to rejoin the normal
>> entry path.
>
> Can we have a few words on why this solution and not this alternative? I
> mean, you raise the possibility, but then surely you chose not to
> implement that. Might as well share that with us.

I can give some pros and cons.  With the other approach:

 - We avoid a pipeline stall.
 - We execute from an extra page and read from another extra page
during the syscall.  (The latter is because we need to use a relative
addressing mode to find sp1 -- it's the same *cacheline* we'd use
anyway, but we're accessing it using an alias, so it's an extra TLB
entry.)
 - We use more memory.  This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a
more complex implementation.
 - More code complexity.

I'm not convinced this is a good tradeoff.