linux-kernel - Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALCETrU2oEiHiqb9gu+ZnDU+zOMk+JqDG2dYFVHsAh5xm2tGtw@mail.gmail.com>
Date:	Wed, 17 Jun 2015 07:23:49 -0700
From:	Andy Lutomirski <luto@...capital.net>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	Andy Lutomirski <luto@...nel.org>, X86 ML <x86@...nel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Frédéric Weisbecker <fweisbec@...il.com>,
	Rik van Riel <riel@...hat.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Denys Vlasenko <vda.linux@...glemail.com>,
	Borislav Petkov <bp@...en8.de>,
	Kees Cook <keescook@...omium.org>,
	Brian Gerst <brgerst@...il.com>
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

On Wed, Jun 17, 2015 at 3:32 AM, Ingo Molnar <mingo@...nel.org> wrote:
>
> * Andy Lutomirski <luto@...nel.org> wrote:
>
>> The main things that are missing are that I haven't done the 32-bit parts
>> (anyone want to help?) and therefore I haven't deleted the old C code.  I also
>> think this may break UML for trivial reasons.
>
> So I'd suggest moving most of the SYSRET fast path to C too.
>
> This is how it looks like now after your patches:
>
>         testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz     tracesys
> entry_SYSCALL_64_fastpath:
> #if __SYSCALL_MASK == ~0
>         cmpq    $__NR_syscall_max, %rax
> #else
>         andl    $__SYSCALL_MASK, %eax
>         cmpl    $__NR_syscall_max, %eax
> #endif
>         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
>         movq    %r10, %rcx
>         call    *sys_call_table(, %rax, 8)
>         movq    %rax, RAX(%rsp)
> 1:
> /*
>  * Syscall return path ending with SYSRET (fast path).
>  * Has incompletely filled pt_regs.
>  */
>         LOCKDEP_SYS_EXIT
>         /*
>          * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
>          * it is too small to ever cause noticeable irq latency.
>          */
>         DISABLE_INTERRUPTS(CLBR_NONE)
>
>         /*
>          * We must check ti flags with interrupts (or at least preemption)
>          * off because we must *never* return to userspace without
>          * processing exit work that is enqueued if we're preempted here.
>          * In particular, returning to userspace with any of the one-shot
>          * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
>          * very bad.
>          */
>         testl   $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz     int_ret_from_sys_call_irqs_off  /* Go to the slow path */
>
> Most of that can be done in C.
>
> And I think we could also convert the IRET syscall return slow path to C too:
>
> GLOBAL(int_ret_from_sys_call)
>         SAVE_EXTRA_REGS
>         movq    %rsp, %rdi
>         call    syscall_return_slowpath /* returns with IRQs disabled */
>         RESTORE_EXTRA_REGS
>
>         /*
>          * Try to use SYSRET instead of IRET if we're returning to
>          * a completely clean 64-bit userspace context.
>          */
>         movq    RCX(%rsp), %rcx
>         movq    RIP(%rsp), %r11
>         cmpq    %rcx, %r11                      /* RCX == RIP */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
>          * in kernel space.  This essentially lets the user take over
>          * the kernel, since userspace controls RSP.
>          *
>          * If width of "canonical tail" ever becomes variable, this will need
>          * to be updated to remain correct on both old and new CPUs.
>          */
>         .ifne __VIRTUAL_MASK_SHIFT - 47
>         .error "virtual address width changed -- SYSRET checks need update"
>         .endif
>
>         /* Change top 16 bits to be the sign-extension of 47th bit */
>         shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>         sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>
>         /* If this changed %rcx, it was not canonical */
>         cmpq    %rcx, %r11
>         jne     opportunistic_sysret_failed
>
>         cmpq    $__USER_CS, CS(%rsp)            /* CS must match SYSRET */
>         jne     opportunistic_sysret_failed
>
>         movq    R11(%rsp), %r11
>         cmpq    %r11, EFLAGS(%rsp)              /* R11 == RFLAGS */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
>          * restoring TF results in a trap from userspace immediately after
>          * SYSRET.  This would cause an infinite loop whenever #DB happens
>          * with register state that satisfies the opportunistic SYSRET
>          * conditions.  For example, single-stepping this user code:
>          *
>          *           movq       $stuck_here, %rcx
>          *           pushfq
>          *           popq %r11
>          *   stuck_here:
>          *
>          * would never get past 'stuck_here'.
>          */
>         testq   $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
>         jnz     opportunistic_sysret_failed
>
>         /* nothing to check for RSP */
>
>         cmpq    $__USER_DS, SS(%rsp)            /* SS must match SYSRET */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * We win! This label is here just for ease of understanding
>          * perf profiles. Nothing jumps here.
>          */
> syscall_return_via_sysret:
>         /* rcx and r11 are already restored (see code above) */
>         RESTORE_C_REGS_EXCEPT_RCX_R11
>         movq    RSP(%rsp), %rsp
>         USERGS_SYSRET64
>
> opportunistic_sysret_failed:
>         SWAPGS
>         jmp     restore_c_regs_and_iret
> END(entry_SYSCALL_64)
>
>
> Basically there would be a single C function we'd call, which returns a condition
> (or fixes up its return address on the stack directly) to determine between the
> SYSRET and IRET return paths.
>
> Moving this to C too has immediate benefits: that way we could easily add
> instrumentation to see how efficient these various return methods are, etc.
>
> I.e. I don't think there's two ways about this: once the entry code moves to the
> domain of C code, we get the best benefits by moving as much of it as possible.

This is almost certainly true.  There are a lot more cleanups possible here.

I want to nail down the 32-bit case first so we can delete the old code.

>
> The only low level bits remaining in assembly will be low level hardware ABI
> details: saving registers and restoring registers to the expected format - no
> 'active' code whatsoever.

I think this is true for syscalls.  Getting the weird special cases
(IRET and GS fault) for error_entry to work correctly in C could be
tricky.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/