linux-kernel - Re: x86 memcpy performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110908083551.GA5646@liondog.tnic>
Date:	Thu, 8 Sep 2011 10:35:51 +0200
From:	Borislav Petkov <bp@...en8.de>
To:	Linus Torvalds <torvalds@...ux-foundation.org>
Cc:	Maarten Lankhorst <m.b.lankhorst@...il.com>,
	Borislav Petkov <bp@...64.org>,
	"Valdis.Kletnieks@...edu" <Valdis.Kletnieks@...edu>,
	Ingo Molnar <mingo@...e.hu>,
	melwyn lobo <linux.melwyn@...il.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"H. Peter Anvin" <hpa@...or.com>,
	Thomas Gleixner <tglx@...utronix.de>,
	Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: x86 memcpy performance

On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
> <m.b.lankhorst@...il.com> wrote:
> >
> > This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> > and I finally figured out why. I also extended the test to an optimized avx memcpy,
> > but I think the kernel memcpy will always win in the aligned case.
> 
> "rep movs" is generally optimized in microcode on most modern Intel
> CPU's for some easyish cases, and it will outperform just about
> anything.
> 
> Atom is a notable exception, but if you expect performance on any
> general loads from Atom, you need to get your head examined. Atom is a
> disaster for anything but tuned loops.
> 
> The "easyish cases" depend on microarchitecture. They are improving,
> so long-term "rep movs" is the best way regardless, but for most
> current ones it's something like "source aligned to 8 bytes *and*
> source and destination are equal "mod 64"".
> 
> And that's true in a lot of common situations. It's true for the page
> copy, for example, and it's often true for big user "read()/write()"
> calls (but "often" may not be "often enough" - high-performance
> userland should strive to align read/write buffers to 64 bytes, for
> example).
> 
> Many other cases of "memcpy()" are the fairly small, constant-sized
> ones, where the optimal strategy tends to be "move words by hand".

Yeah,

this probably makes enabling SSE memcpy in the kernel a task
with diminishing returns. There are also the additional costs of
saving/restoring FPU context in the kernel which eat off from any SSE
speedup.

And then there's the additional I$ pressure because "rep movs" is
much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
smallest (two-byte) instructions I could use - in the AVX case they can
get up to 4 Bytes of length with the VEX prefix and the additional SIB,
size override, etc. fields.

Oh, and then there's copy_*_user which also does fault handling and
replacing that with a SSE version of memcpy could get quite hairy quite
fast.

Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
when I get the time to see whether it still makes sense, at all.

Thanks.

-- 
Regards/Gruss,
    Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/