[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <658cdb28-e3e5-c0af-368f-c26daf9986ac@kernel.dk>
Date: Wed, 21 Nov 2018 11:04:54 -0700
From: Jens Axboe <axboe@...nel.dk>
To: Linus Torvalds <torvalds@...ux-foundation.org>, pabeni@...hat.com
Cc: Ingo Molnar <mingo@...nel.org>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, bp@...en8.de,
Peter Anvin <hpa@...or.com>,
the arch/x86 maintainers <x86@...nel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Andrew Lutomirski <luto@...nel.org>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>, dvlasenk@...hat.com,
brgerst@...il.com,
Linux List Kernel Mailing <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes
On 11/21/18 10:27 AM, Linus Torvalds wrote:
> On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pabeni@...hat.com> wrote:
>>
>> In my experiments 64 bytes was the break even point for all the CPUs I
>> had handy, but I guess that may change with other models.
>
> Note that experiments with memcpy speed are almost invariably broken.
> microbenchmarks don't show the impact of I$, but they also don't show
> the impact of _behavior_.
>
> For example, there might be things like "repeat strings do cacheline
> optimizations" that end up meaning that cachelines stay in L2, for
> example, and are never brought into L1. That can be a really good
> thing, but it can also mean that now the result isn't as close to the
> CPU, and the subsequent use of the cacheline can be costlier.
Totally agree, which is why all my testing was NOT microbenchmarking.
> I say "go for upping the limit to 128 bytes".
See below...
> That said, if the aio user copy is _so_ critical that it's this
> noticeable, there may be other issues. Sometimes _real_ cost of small
> user copies is often the STAC/CLAC, more so than the "rep movs".
>
> It would be interesting to know exactly which copy it is that matters
> so much... *inlining* the erms case might show that nicely in
> profiles.
Oh I totally agree, which is why I since went a different route. The
copy that matters is the copy_from_user() of the iocb, which is 64
bytes. Even for 4k IOs, copying 64b per IO is somewhat counter
productive for O_DIRECT.
Playing around with this:
http://git.kernel.dk/cgit/linux-block/commit/?h=aio-poll&id=ed0a0a445c0af4cfd18b0682511981eaf352d483
since we're doing a new sys_io_setup2() for polled aio anyway. This
completely avoids the iocb copy, but that's just for my initial
particular gripe.
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index db4e5aa0858b..21c4d68c5fac 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
*/
ENTRY(copy_user_enhanced_fast_string)
ASM_STAC
- cmpl $64,%edx
- jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
+ cmpl $128,%edx
+ jb .L_copy_short_string /* less then 128 bytes, avoid costly 'rep' */
movl %edx,%ecx
1: rep
movsb
--
Jens Axboe
Powered by blists - more mailing lists