linux-kernel - Re: x86 memcpy performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4E689FC5.8010005@gmail.com>
Date:	Thu, 08 Sep 2011 12:58:13 +0200
From:	Maarten Lankhorst <m.b.lankhorst@...il.com>
To:	Borislav Petkov <bp@...en8.de>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Borislav Petkov <bp@...64.org>,
	"Valdis.Kletnieks@...edu" <Valdis.Kletnieks@...edu>,
	Ingo Molnar <mingo@...e.hu>,
	melwyn lobo <linux.melwyn@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: x86 memcpy performance

On 09/08/2011 10:35 AM, Borislav Petkov wrote:
> On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
>> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
>> <m.b.lankhorst@...il.com> wrote:
>>> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
>>> and I finally figured out why. I also extended the test to an optimized avx memcpy,
>>> but I think the kernel memcpy will always win in the aligned case.
>> "rep movs" is generally optimized in microcode on most modern Intel
>> CPU's for some easyish cases, and it will outperform just about
>> anything.
>>
>> Atom is a notable exception, but if you expect performance on any
>> general loads from Atom, you need to get your head examined. Atom is a
>> disaster for anything but tuned loops.
>>
>> The "easyish cases" depend on microarchitecture. They are improving,
>> so long-term "rep movs" is the best way regardless, but for most
>> current ones it's something like "source aligned to 8 bytes *and*
>> source and destination are equal "mod 64"".
>>
>> And that's true in a lot of common situations. It's true for the page
>> copy, for example, and it's often true for big user "read()/write()"
>> calls (but "often" may not be "often enough" - high-performance
>> userland should strive to align read/write buffers to 64 bytes, for
>> example).
>>
>> Many other cases of "memcpy()" are the fairly small, constant-sized
>> ones, where the optimal strategy tends to be "move words by hand".
> Yeah,
>
> this probably makes enabling SSE memcpy in the kernel a task
> with diminishing returns. There are also the additional costs of
> saving/restoring FPU context in the kernel which eat off from any SSE
> speedup.
>
> And then there's the additional I$ pressure because "rep movs" is
> much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
> smallest (two-byte) instructions I could use - in the AVX case they can
> get up to 4 Bytes of length with the VEX prefix and the additional SIB,
> size override, etc. fields.
>
> Oh, and then there's copy_*_user which also does fault handling and
> replacing that with a SSE version of memcpy could get quite hairy quite
> fast.
>
> Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
> when I get the time to see whether it still makes sense, at all.
>
I have changed your sse memcpy to test various alignments with
source/destination offsets instead of random, from that you can
see that you don't really get a speedup at all. It seems to be more
a case of 'kernel memcpy is significantly slower with some alignments',
than 'avx memcpy is just that much faster'.

For example 3754 with src misalignment 4 and target misalignment 20
takes 1185 units on avx memcpy, but 1480 units with kernel memcpy

The modified testcase is attached, I did some optimizations in avx memcpy,
but I fear I may be missing something, when I tried to put it in the kernel, it
complained about sata errors I never had before, so I immediately went for
the power button to prevent more errors, fortunately it only corrupted some
kernel object files, and btrfs threw checksum errors. :)

All in all I think testing in userspace is safer, you might want to run it on an
idle cpu with schedtool, with a high fifo priority, and set cpufreq governor to
performance.

~Maarten

Download attachment "memcpy.tar.gz" of type "application/x-gzip" (4352 bytes)