linux-kernel - Re: [PATCH 4/5] x86/asm/entry/32: Replace RESTORE_RSI_RDI[

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <5582A47B.4020802@redhat.com>
Date:	Thu, 18 Jun 2015 12:59:07 +0200
From:	Denys Vlasenko <dvlasenk@...hat.com>
To:	Ingo Molnar <mingo@...nel.org>
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Borislav Petkov <bp@...en8.de>,
	"H. Peter Anvin" <hpa@...or.com>,
	Andy Lutomirski <luto@...capital.net>,
	Oleg Nesterov <oleg@...hat.com>,
	Frederic Weisbecker <fweisbec@...il.com>,
	Alexei Starovoitov <ast@...mgrid.com>,
	Will Drewry <wad@...omium.org>,
	Kees Cook <keescook@...omium.org>, x86@...nel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 4/5] x86/asm/entry/32: Replace RESTORE_RSI_RDI[_RDX] with
 open-coded 32-bit reads

On 06/18/2015 11:31 AM, Ingo Molnar wrote:
>> If it takes, say, 20 cycles to pull data from e.g. L3 cache to ECX,
>> then SYSRET can't possibly complete sooner than in 20 cycles.
>
> Yeah, that's true, but my point is: SYSRET has to do a lot of other things
> (permission checks, loading the user mode state - most of which are unrelated to
> R11/RCX), which take dozens of cycles,

SYSRET was designed to avoid doing that. It does not check permissions
- it slam-dunks CPL3 and resets CS and SS to preset values.
It does not touch stack register or restores any other GP register.

Having said that, I'd try to get cold hard facts, i.e. experimentally
measure SYSRET latency.


> and which are probably overlapped with any
> cache misses on arguments such as R11/RCX.
>
> It's not impossible that reordering helps, for example if SYSRET has some internal 
> dependencies that makes it parallelism worse than ideal - but I'd complicate this 
> code only if it gives a measurable improvement for cache-cold syscall performance.

I attempted to test it. With the patch which moves RCX and R11 loads all the way down:

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index f2064bd..0ea09a3 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -139,9 +139,6 @@ sysexit_from_sys_call:
 	 * with 'sysenter' and it uses the SYSENTER calling convention.
 	 */
 	andl	$~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
-	/* Prepare registers for SYSRET insn */
-	movl	RIP(%rsp), %ecx		/* User %eip */
-	movl	EFLAGS(%rsp), %r11d	/* User eflags *
 	/* Restore registers per SYSEXIT ABI requirements: */
 	/* arg1 (ebx): preserved by virtue of being a callee-saved register */
 	/* arg2 (ecx): used by SYSEXIT to restore esp (and by SYSRET to restore eip) */
@@ -155,6 +152,9 @@ sysexit_from_sys_call:
 	xorl	%r8d, %r8d
 	xorl	%r9d, %r9d
 	xorl	%r10d, %r10d
+	/* Prepare registers for SYSRET insn */
+	movl	RIP(%rsp), %ecx		/* User %eip */
+	movl	EFLAGS(%rsp), %r11d	/* User eflags *
 	TRACE_IRQS_ON

 	/*
@@ -374,9 +374,6 @@ cstar_dispatch:

 sysretl_from_sys_call:
 	andl	$~TS_COMPAT, ASM_THREAD_INFO(TI_status, %rsp, SIZEOF_PTREGS)
-	/* Prepare registers for SYSRET insn */
-	movl	RIP(%rsp), %ecx		/* User %eip */
-	movl	EFLAGS(%rsp), %r11d	/* User eflags */
 	/* Restore registers per SYSRET ABI requirements: */
 	/* arg1 (ebx): preserved by virtue of being a callee-saved register */
 	/* arg2 (ebp): preserved (already restored, see above) */
@@ -388,6 +385,9 @@ sysretl_from_sys_call:
 	xorl	%r8d, %r8d
 	xorl	%r9d, %r9d
 	xorl	%r10d, %r10d
+	/* Prepare registers for SYSRET insn */
+	movl	RIP(%rsp), %ecx		/* User %eip */
+	movl	EFLAGS(%rsp), %r11d	/* User eflags */
 	TRACE_IRQS_ON
 	movl	RSP(%rsp), %esp
 	/*

This does not change instructions sizes and therefore code
cacheline alignments over entire bzImage.


Testing getpid() in a loop (IOW: cache-hot test) did show that with
this patch it is slower, but by statistically insignificant amount:

before patch, it's 61.92 ns per syscall.
after patch, it's  61.99 ns per syscall.

That's less than one cycle, more like 0.15 cycles.
However, it is reproducible.

I did not figure out how to do a cache-cold test.
Tried a 65kbyte-ish read from "/dev/zero". That takes ~3885 ns
and its variability of +-10 ns drowns out a possible difference.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/