linux-kernel - Re: [PATCH 1/2] x86: use alternatives for clear

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150621065207.GA29597@gmail.com>
Date:	Sun, 21 Jun 2015 08:52:07 +0200
From:	Ingo Molnar <mingo@...nel.org>
To:	Alexey Dobriyan <adobriyan@...il.com>
Cc:	hpa@...or.com, x86@...nel.org, linux-kernel@...r.kernel.org,
	Andy Lutomirski <luto@...nel.org>,
	Borislav Petkov <bp@...en8.de>,
	Brian Gerst <brgerst@...il.com>,
	Denys Vlasenko <dvlasenk@...hat.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: [PATCH 1/2] x86: use alternatives for clear_user()


* Alexey Dobriyan <adobriyan@...il.com> wrote:

> Alternatives allow to pick faster code: REP STOSQ, REP STOSB or else. Default to 
> REP STOSQ (as memset() does).
> 
> Sticking printk() into clear_user() and booting Debian showed that it is usually 
> executed with 4-8 bytes of alignment and ~2000-4000 bytes of length. On this 
> scale difference should be noticeable.
> 
> Also make __clear_user() more, shall we say, "modern". Remove storing zeroes via 
> zeroed register, replace INC/DEC with ADD/SUB for dependency breakage.

Yeah, so I like the direction of these changes (modulo the suggestions I make for 
the patch, see further below), but note that our (new) policy with x86 assembly 
patches is pretty brutal:

1)

This patch should be split into at least a series of 3 patches:

  - patch #1 moves the old code into the new file

  - patch #2 does the modernization of assembly ops to the old MOVQ method

  - patch #3 adds the alternatives patching to introduce the STOSB and STOSQ
    methods

2)

Please cite before/after /usr/bin/size comparisons of the .o (and vmlinux where 
sensible) in the changelog (you can also do objdump -d comparisons), especially 
for the first and second patch this will show that the move is an invariant, and 
that the effects of the modernization of the MOVQ method.

3)

We require benchmark numbers for changes to such commonly used APIs: please add 
clear_user() support to 'perf bench', as an extension to the existing memcpy and 
memset benchmarks:

 triton:~/tip> ls -l tools/perf/bench/mem-*x86*.S
 -rw-rw-r-- 1 mingo mingo 413 Jun 18 22:53 tools/perf/bench/mem-memcpy-x86-64-asm.S
 -rw-rw-r-- 1 mingo mingo 414 Jun 18 22:53 tools/perf/bench/mem-memset-x86-64-asm.S

You can run them with something like:

  galatea:~> taskset 1 perf bench mem memset -o -i 10000 -l 1600 -r all

  # Running 'mem/memset' benchmark:
  Routine default (Default memset() provided by glibc)
  # Copying 1600 Bytes ...
  
        44.348694 GB/Sec (with prefault)
  Routine x86-64-unrolled (unrolled memset() in arch/x86/lib/memset_64.S)
  # Copying 1600 Bytes ...
  
        24.548865 GB/Sec (with prefault)
  Routine x86-64-stosq (movsq-based memset() in arch/x86/lib/memset_64.S)
  # Copying 1600 Bytes ...
  
        27.645939 GB/Sec (with prefault)
  Routine x86-64-stosb (movsb-based memset() in arch/x86/lib/memset_64.S)
  # Copying 1600 Bytes ...
  
        47.760132 GB/Sec (with prefault)

You might want to test the specific length values that your printk profiling has 
shown to be characteristic, via the -l <length> parameter, because some of these 
routines are sensitive to alignment and length details.

Note that this kind of benchmarking matters: for example on this box where I ran 
the above memset test it clearly shows that the STOSB method is superior.

I can help out with extending 'perf bench' with clear_user measurements if you'd 
like.

> +# unsigned long __clear_user(void __user *, unsigned long);
> +ENTRY(__clear_user)
> +	CFI_STARTPROC

There's no CFI_* in the x86 tree anymore, please send a patch against tip:master.

> +
> +	ALTERNATIVE_2 "jmp __clear_user_movq",			\
> +		"", X86_FEATURE_REP_GOOD,			\
> +		"jmp __clear_user_rep_stosb", X86_FEATURE_ERMS

Can we move this into clear_user(), and patch in CALL instructions instead of 
jumps? There's no reason to do these extra jumps.

> +

So for consistency's sake I'd put a label here that names the default function 
__clear_user_rep_stosq. So that we know what it is when it shows up in 'perf top'.

(With the 'CALL' patching approach this would become __clear_user_rep_stosq in a 
natural fashion - so in that case the extra label is not needed.)

> +	ASM_STAC
> +	xor	%eax, %eax
> +	mov	%rsi, %rcx
> +	and	$7, %esi
> +	shr	$3, %rcx
> +1:	rep stosq
> +	mov	%esi, %ecx
> +2:	rep stosb
> +3:
> +	mov	%rcx, %rax
> +	ASM_CLAC
> +	ret

So I'd switch the ASM_CLAC with the MOV, because flags manipulation probably has 
higher latency than a simple register move. (the same reason we do the STAC as the 
first step)

> +ENTRY(__clear_user_movq)
> +	CFI_STARTPROC
> +
> +	ASM_STAC

With JMP patching: so every __clear_user_* method starts with a STAC, so I'd move 
the STAC into the __clear_user function prologue, before the __clear_user_* 
alternatives patching site - this might reduce flag value dependencies as well by 
the time it's used.

With CALL patching: this can needs to be in the specific functions, like you have 
it now.

> +	mov	%rsi, %rcx
> +	and	$7, %esi
> +	shr	$3, %rcx
> +	jz	2f
> +	.p2align 4

There's no need to align this small loop - the NOP only slows things down.

> +1:
> +	movq	$0, (%rdi)
> +	add	$8, %rdi
> +	sub	$1, %rcx
> +	jnz	1b
> +2:
> +	mov	%esi, %ecx
> +	test	%ecx, %ecx
> +	jz	4f
> +	.p2align 4

Ditto.

> +3:
> +	movb	$0, (%rdi)
> +	add	$1, %rdi
> +	sub	$1, %ecx
> +	jnz	3b
> +4:
> +	mov	%rcx, %rax
> +	ASM_CLAC

I'd flip this one too.

> +	ret
> +
> +	.section .fixup,"ax"
> +5:	lea	(%rsi,%rcx,8),%rcx
> +	jmp	4b
> +	.previous
> +
> +	_ASM_EXTABLE(1b,5b)
> +	_ASM_EXTABLE(3b,4b)
> +
> +	CFI_ENDPROC
> +ENDPROC(__clear_user_movq)
> +
> +ENTRY(__clear_user_rep_stosb)
> +	CFI_STARTPROC
> +
> +	ASM_STAC
> +	xor	%eax, %eax
> +	mov	%rsi, %rcx
> +1:	rep stosb
> +2:
> +	mov	%rcx, %rax
> +	ASM_CLAC

And I'd flip this one too.

> +	ret
> +
> +	_ASM_EXTABLE(1b,2b)
> +
> +	CFI_ENDPROC
> +ENDPROC(__clear_user_rep_stosb)

> @@ -53,6 +19,7 @@ unsigned long clear_user(void __user *to, unsigned long n)
>  	return n;
>  }
>  EXPORT_SYMBOL(clear_user);
> +EXPORT_SYMBOL(__clear_user);

Also, please consider inlining clear_user()'s access_ok() check: so that the only 
function left is __clear_user(). (if then that should be a separate patch)

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/