lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 21 Nov 2018 11:26:09 -0700
From:   Andy Lutomirski <luto@...capital.net>
To:     Jens Axboe <axboe@...nel.dk>, dave.hansen@...el.com
Cc:     Linus Torvalds <torvalds@...ux-foundation.org>, pabeni@...hat.com,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, bp@...en8.de,
        Peter Anvin <hpa@...or.com>,
        the arch/x86 maintainers <x86@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrew Lutomirski <luto@...nel.org>,
        Peter Zijlstra <a.p.zijlstra@...llo.nl>, dvlasenk@...hat.com,
        brgerst@...il.com,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes



> On Nov 21, 2018, at 11:04 AM, Jens Axboe <axboe@...nel.dk> wrote:
> 
>> On 11/21/18 10:27 AM, Linus Torvalds wrote:
>>> On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pabeni@...hat.com> wrote:
>>> 
>>> In my experiments 64 bytes was the break even point for all the CPUs I
>>> had handy, but I guess that may change with other models.
>> 
>> Note that experiments with memcpy speed are almost invariably broken.
>> microbenchmarks don't show the impact of I$, but they also don't show
>> the impact of _behavior_.
>> 
>> For example, there might be things like "repeat strings do cacheline
>> optimizations" that end up meaning that cachelines stay in L2, for
>> example, and are never brought into L1. That can be a really good
>> thing, but it can also mean that now the result isn't as close to the
>> CPU, and the subsequent use of the cacheline can be costlier.
> 
> Totally agree, which is why all my testing was NOT microbenchmarking.
> 
>> I say "go for upping the limit to 128 bytes".
> 
> See below...
> 
>> That said, if the aio user copy is _so_ critical that it's this
>> noticeable, there may be other issues. Sometimes _real_ cost of small
>> user copies is often the STAC/CLAC, more so than the "rep movs".
>> 
>> It would be interesting to know exactly which copy it is that matters
>> so much...  *inlining* the erms case might show that nicely in
>> profiles.
> 
> Oh I totally agree, which is why I since went a different route. The
> copy that matters is the copy_from_user() of the iocb, which is 64
> bytes. Even for 4k IOs, copying 64b per IO is somewhat counter
> productive for O_DIRECT.

Can we maybe use this as an excuse to ask for some reasonable instructions to access user memory?  Intel already did all the dirty work of giving something resembling sane semantics for the kernel doing a user-privileged access with WRUSS.  How about WRUSER, RDUSER, and maybe even the REP variants?  And I suppose LOCK CMPXCHGUSER.

Or Intel could try to make STAC and CLAC be genuinely fast (0 or 1 cycles and no stalls *ought* to be possible if it were handled in the front end, as long as there aren’t any PUSHF or POPF instructions in the pipeline).  As it stands, I assume that both instructions prevent any following memory accesses from starting until they retire, and they might even be nastily microcoded to handle the overloading of AC.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ