linux-kernel - Re: kernel %rsp code at sysenter PTI vs no-PTI

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CALCETrURzP9N8xHd7+prOoJWZ690R2261-tHwgwnRu4pwn62VA@mail.gmail.com>
Date:   Sat, 21 Jul 2018 17:02:17 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     Andy Lutomirski <luto@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        "the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: kernel %rsp code at sysenter PTI vs no-PTI

On Thu, Jul 5, 2018 at 10:14 AM, Dave Hansen <dave.hansen@...el.com> wrote:
> The PTI path does this:
>
>         ...
>         SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
>         /* Load the top of the task stack into RSP */
>         movq    CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
>
> And the non-PTI entry path does this:
>
>         ...
>         movq    %rsp, PER_CPU_VAR(rsp_scratch)
>         movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>
> Both "mov ___, %rsp" instructions have the kernel %GS value in place and
> both are running on a good kernel CR3.  Does anybody remember why we
> don't use cpu_current_top_of_stack in the PTI-on case?
>
> I'm wondering if it was because we, at some point, did the mov ...,
> %rsp before CR3 was good.  But it doesn't look like we do that now, so
> should we maybe make both copies do:
>
>         movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

Speed, sort of.  Without the CR3 switch there (i.e. PTI off, but
trampoline still in use, which is the path that actually gets used),
there's no forced serialization between swapgs and that movq.  And it
turns out that the RIP-relative load avoids a pipeline stall that the
%gs-relative access right after swapgs would cause.  So, with all the
mitigations off, the trampoline ends up being *faster*, at least in a
tight loop, than the non-trampolined path.

Of course, on a retpolined kernel, the retpoline at the end kills performance.

>
> for consistency?