[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110908083551.GA5646@liondog.tnic>
Date: Thu, 8 Sep 2011 10:35:51 +0200
From: Borislav Petkov <bp@...en8.de>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Maarten Lankhorst <m.b.lankhorst@...il.com>,
Borislav Petkov <bp@...64.org>,
"Valdis.Kletnieks@...edu" <Valdis.Kletnieks@...edu>,
Ingo Molnar <mingo@...e.hu>,
melwyn lobo <linux.melwyn@...il.com>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"H. Peter Anvin" <hpa@...or.com>,
Thomas Gleixner <tglx@...utronix.de>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>
Subject: Re: x86 memcpy performance
On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
> <m.b.lankhorst@...il.com> wrote:
> >
> > This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> > and I finally figured out why. I also extended the test to an optimized avx memcpy,
> > but I think the kernel memcpy will always win in the aligned case.
>
> "rep movs" is generally optimized in microcode on most modern Intel
> CPU's for some easyish cases, and it will outperform just about
> anything.
>
> Atom is a notable exception, but if you expect performance on any
> general loads from Atom, you need to get your head examined. Atom is a
> disaster for anything but tuned loops.
>
> The "easyish cases" depend on microarchitecture. They are improving,
> so long-term "rep movs" is the best way regardless, but for most
> current ones it's something like "source aligned to 8 bytes *and*
> source and destination are equal "mod 64"".
>
> And that's true in a lot of common situations. It's true for the page
> copy, for example, and it's often true for big user "read()/write()"
> calls (but "often" may not be "often enough" - high-performance
> userland should strive to align read/write buffers to 64 bytes, for
> example).
>
> Many other cases of "memcpy()" are the fairly small, constant-sized
> ones, where the optimal strategy tends to be "move words by hand".
Yeah,
this probably makes enabling SSE memcpy in the kernel a task
with diminishing returns. There are also the additional costs of
saving/restoring FPU context in the kernel which eat off from any SSE
speedup.
And then there's the additional I$ pressure because "rep movs" is
much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
smallest (two-byte) instructions I could use - in the AVX case they can
get up to 4 Bytes of length with the VEX prefix and the additional SIB,
size override, etc. fields.
Oh, and then there's copy_*_user which also does fault handling and
replacing that with a SSE version of memcpy could get quite hairy quite
fast.
Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
when I get the time to see whether it still makes sense, at all.
Thanks.
--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists