lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CALCETrURzP9N8xHd7+prOoJWZ690R2261-tHwgwnRu4pwn62VA@mail.gmail.com>
Date:   Sat, 21 Jul 2018 17:02:17 -0700
From:   Andy Lutomirski <luto@...nel.org>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     Andy Lutomirski <luto@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>,
        "the arch/x86 maintainers" <x86@...nel.org>
Subject: Re: kernel %rsp code at sysenter PTI vs no-PTI

On Thu, Jul 5, 2018 at 10:14 AM, Dave Hansen <dave.hansen@...el.com> wrote:
> The PTI path does this:
>
>         ...
>         SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
>         /* Load the top of the task stack into RSP */
>         movq    CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
>
> And the non-PTI entry path does this:
>
>         ...
>         movq    %rsp, PER_CPU_VAR(rsp_scratch)
>         movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>
> Both "mov ___, %rsp" instructions have the kernel %GS value in place and
> both are running on a good kernel CR3.  Does anybody remember why we
> don't use cpu_current_top_of_stack in the PTI-on case?
>
> I'm wondering if it was because we, at some point, did the mov ...,
> %rsp before CR3 was good.  But it doesn't look like we do that now, so
> should we maybe make both copies do:
>
>         movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp

Speed, sort of.  Without the CR3 switch there (i.e. PTI off, but
trampoline still in use, which is the path that actually gets used),
there's no forced serialization between swapgs and that movq.  And it
turns out that the RIP-relative load avoids a pipeline stall that the
%gs-relative access right after swapgs would cause.  So, with all the
mitigations off, the trampoline ends up being *faster*, at least in a
tight loop, than the non-trampolined path.

Of course, on a retpolined kernel, the retpoline at the end kills performance.

>
> for consistency?

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ