linux-kernel - Re: x86 memcpy performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110909081407.GA29251@liondog.tnic>
Date:	Fri, 9 Sep 2011 10:14:07 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Maarten Lankhorst <m.b.lankhorst@...il.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Borislav Petkov <bp@...64.org>,
	"Valdis.Kletnieks@...edu" <Valdis.Kletnieks@...edu>,
	Ingo Molnar <mingo@...e.hu>,
	melwyn lobo <linux.melwyn@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: x86 memcpy performance

On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
> I have changed your sse memcpy to test various alignments with
> source/destination offsets instead of random, from that you can
> see that you don't really get a speedup at all. It seems to be more
> a case of 'kernel memcpy is significantly slower with some alignments',
> than 'avx memcpy is just that much faster'.
> 
> For example 3754 with src misalignment 4 and target misalignment 20
> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy

Right, so the idea is to check whether with the bigger buffer sizes
(and misaligned, although this should not be that often the case in
the kernel) the SSE version would outperform a "rep movs" with ucode
optimizations not kicking in.

With your version modified back to SSE memcpy (don't have an AVX box
right now) I get on an AMD F10h:

...
16384(12/40)    4756.24         7867.74         1.654192552
16384(40/12)    5067.81         6068.71         1.197500008
16384(12/44)    4341.3          8474.96         1.952172387
16384(44/12)    4277.13         7107.64         1.661777347
16384(12/48)    4989.16         7964.54         1.596369011
16384(48/12)    4644.94         6499.5          1.399264281
...

which looks like pretty nice numbers to me. I can't say whether there
ever is 16K buffer we copy in the kernel but if there were... But <16K
buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
As I said, best it would be to put it in the kernel and run a bunch of
benchmarks...

> The modified testcase is attached, I did some optimizations in avx
> memcpy, but I fear I may be missing something, when I tried to put it
> in the kernel, it complained about sata errors I never had before,
> so I immediately went for the power button to prevent more errors,
> fortunately it only corrupted some kernel object files, and btrfs
> threw checksum errors. :)

Well, your version should do something similar to what _mmx_memcpy does:
save FPU state and not execute in IRQ context.

> All in all I think testing in userspace is safer, you might want to
> run it on an idle cpu with schedtool, with a high fifo priority, and
> set cpufreq governor to performance.

No, you need a generic system with default settings - otherwise it is
blatant benchmark lying :-)

-- 
Regards/Gruss,
    Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/