linux-kernel - Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150617103226.GA30325@gmail.com>
Date:	Wed, 17 Jun 2015 12:32:26 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Andy Lutomirski <luto@...nel.org>
Cc:	x86@...nel.org, linux-kernel@...r.kernel.org,
	Frédéric Weisbecker <fweisbec@...il.com>,
	Rik van Riel <riel@...hat.com>,
	Oleg Nesterov <oleg@...hat.com>,
	Denys Vlasenko <vda.linux@...glemail.com>,
	Borislav Petkov <bp@...en8.de>,
	Kees Cook <keescook@...omium.org>,
	Brian Gerst <brgerst@...il.com>
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code


* Andy Lutomirski <luto@...nel.org> wrote:

> The main things that are missing are that I haven't done the 32-bit parts 
> (anyone want to help?) and therefore I haven't deleted the old C code.  I also 
> think this may break UML for trivial reasons.

So I'd suggest moving most of the SYSRET fast path to C too.

This is how it looks like now after your patches:

	testl	$_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
	jnz	tracesys
entry_SYSCALL_64_fastpath:
#if __SYSCALL_MASK == ~0
	cmpq	$__NR_syscall_max, %rax
#else
	andl	$__SYSCALL_MASK, %eax
	cmpl	$__NR_syscall_max, %eax
#endif
	ja	1f				/* return -ENOSYS (already in pt_regs->ax) */
	movq	%r10, %rcx
	call	*sys_call_table(, %rax, 8)
	movq	%rax, RAX(%rsp)
1:
/*
 * Syscall return path ending with SYSRET (fast path).
 * Has incompletely filled pt_regs.
 */
	LOCKDEP_SYS_EXIT
	/*
	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
	 * it is too small to ever cause noticeable irq latency.
	 */
	DISABLE_INTERRUPTS(CLBR_NONE)

	/*
	 * We must check ti flags with interrupts (or at least preemption)
	 * off because we must *never* return to userspace without
	 * processing exit work that is enqueued if we're preempted here.
	 * In particular, returning to userspace with any of the one-shot
	 * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
	 * very bad.
	 */
	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
	jnz	int_ret_from_sys_call_irqs_off	/* Go to the slow path */

Most of that can be done in C.

And I think we could also convert the IRET syscall return slow path to C too:

GLOBAL(int_ret_from_sys_call)
	SAVE_EXTRA_REGS
	movq	%rsp, %rdi
	call	syscall_return_slowpath	/* returns with IRQs disabled */
	RESTORE_EXTRA_REGS

	/*
	 * Try to use SYSRET instead of IRET if we're returning to
	 * a completely clean 64-bit userspace context.
	 */
	movq	RCX(%rsp), %rcx
	movq	RIP(%rsp), %r11
	cmpq	%rcx, %r11			/* RCX == RIP */
	jne	opportunistic_sysret_failed

	/*
	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
	 * in kernel space.  This essentially lets the user take over
	 * the kernel, since userspace controls RSP.
	 *
	 * If width of "canonical tail" ever becomes variable, this will need
	 * to be updated to remain correct on both old and new CPUs.
	 */
	.ifne __VIRTUAL_MASK_SHIFT - 47
	.error "virtual address width changed -- SYSRET checks need update"
	.endif

	/* Change top 16 bits to be the sign-extension of 47th bit */
	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx

	/* If this changed %rcx, it was not canonical */
	cmpq	%rcx, %r11
	jne	opportunistic_sysret_failed

	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
	jne	opportunistic_sysret_failed

	movq	R11(%rsp), %r11
	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
	jne	opportunistic_sysret_failed

	/*
	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
	 * restoring TF results in a trap from userspace immediately after
	 * SYSRET.  This would cause an infinite loop whenever #DB happens
	 * with register state that satisfies the opportunistic SYSRET
	 * conditions.  For example, single-stepping this user code:
	 *
	 *           movq	$stuck_here, %rcx
	 *           pushfq
	 *           popq %r11
	 *   stuck_here:
	 *
	 * would never get past 'stuck_here'.
	 */
	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
	jnz	opportunistic_sysret_failed

	/* nothing to check for RSP */

	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
	jne	opportunistic_sysret_failed

	/*
	 * We win! This label is here just for ease of understanding
	 * perf profiles. Nothing jumps here.
	 */
syscall_return_via_sysret:
	/* rcx and r11 are already restored (see code above) */
	RESTORE_C_REGS_EXCEPT_RCX_R11
	movq	RSP(%rsp), %rsp
	USERGS_SYSRET64

opportunistic_sysret_failed:
	SWAPGS
	jmp	restore_c_regs_and_iret
END(entry_SYSCALL_64)


Basically there would be a single C function we'd call, which returns a condition 
(or fixes up its return address on the stack directly) to determine between the 
SYSRET and IRET return paths.

Moving this to C too has immediate benefits: that way we could easily add 
instrumentation to see how efficient these various return methods are, etc.

I.e. I don't think there's two ways about this: once the entry code moves to the 
domain of C code, we get the best benefits by moving as much of it as possible. 

The only low level bits remaining in assembly will be low level hardware ABI 
details: saving registers and restoring registers to the expected format - no 
'active' code whatsoever.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/