linux-kernel - Re: x86 memcpy performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110909134233.GA1147@gere.osrc.amd.com>
Date:	Fri, 9 Sep 2011 15:42:33 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Maarten Lankhorst <m.b.lankhorst@...il.com>
CC:	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Valdis.Kletnieks@...edu" <Valdis.Kletnieks@...edu>,
	Ingo Molnar <mingo@...e.hu>,
	melwyn lobo <linux.melwyn@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: x86 memcpy performance

On Fri, Sep 09, 2011 at 01:23:09PM +0200, Maarten Lankhorst wrote:
> This specific one happened far more than any of the other memcpy usages, and
> ignoring the check when destination is page aligned, most of them are gone.
> 
> In short: I don't think I can get a speedup by using avx memcpy in-kernel.
> 
> YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst
> case, but for the common aligned cases too. Or some concrete numbers that misaligned
> happens a lot for you.

Actually,

assuming alignment matters, I'd need to redo the trace_printk run I did
initially on buffer sizes:

http://marc.info/?l=linux-kernel&m=131331602309340 (kernel_build.sizes attached)

to get a more sensible grasp on the alignment of kernel buffers along
with their sizes and to see whether we're doing a lot of unaligned large
buffer copies in the kernel. I seriously doubt that, though, we should
be doing everything pagewise anyway so...

Concerning numbers, I ran your version again and sorted the output by
speedup. The highest scores are:

30037(12/44)	5566.4		12797.2		2.299011642
28672(12/44)	5512.97		12588.7		2.283467991
30037(28/60)	5610.34		12732.7		2.269502799
27852(12/44)	5398.36		12242.4		2.267803859
30037(4/36)	5585.02		12598.6		2.25578257
28672(28/60)	5499.11		12317.5		2.239914033
27852(28/60)	5349.78		11918.9		2.227919527
27852(20/52)	5335.92		11750.7		2.202186795
24576(12/44)	4991.37		10987.2		2.201247446

and this is pretty cool. Here are the (0/0) cases:

8192(0/0)       2627.82         3038.43         1.156255766
12288(0/0)      3116.62         3675.98         1.179475031
13926(0/0)      3330.04         4077.08         1.224334839
14336(0/0)      3377.95         4067.24         1.204055286
15018(0/0)      3465.3          4215.3          1.216430725
16384(0/0)      3623.33         4442.38         1.226050715
24576(0/0)      4629.53         6021.81         1.300737559
27852(0/0)      5026.69         6619.26         1.316823133
28672(0/0)      5157.73         6831.39         1.324495749
30037(0/0)      5322.01         6978.36         1.3112261

It is not 2x anymore but still.

Anyway, looking at the buffer sizes, they're rather ridiculous and even
if we get them in some workload, they won't repeat n times per second to
be relevant. So we'll see...

Thanks.

-- 
Regards/Gruss,
Boris.

View attachment "kernel_build.sizes" of type "text/plain" (926 bytes)