linux-kernel - Re: [PATCH] x86: only use ERMS for user copies for larger sizes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wgnjB+=o0771=J_YkQzabU5aadh6pN3x9Vk4HPs3MHL3g@mail.gmail.com>
Date:   Wed, 21 Nov 2018 09:27:32 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     pabeni@...hat.com
Cc:     Jens Axboe <axboe@...nel.dk>, Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, bp@...en8.de,
        Peter Anvin <hpa@...or.com>,
        "the arch/x86 maintainers" <x86@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Andrew Lutomirski <luto@...nel.org>,
        Peter Zijlstra <a.p.zijlstra@...llo.nl>, dvlasenk@...hat.com,
        brgerst@...il.com,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] x86: only use ERMS for user copies for larger sizes

On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> In my experiments 64 bytes was the break even point for all the CPUs I
> had handy, but I guess that may change with other models.

Note that experiments with memcpy speed are almost invariably broken.
microbenchmarks don't show the impact of I$, but they also don't show
the impact of _behavior_.

For example, there might be things like "repeat strings do cacheline
optimizations" that end up meaning that cachelines stay in L2, for
example, and are never brought into L1. That can be a really good
thing, but it can also mean that now the result isn't as close to the
CPU, and the subsequent use of the cacheline can be costlier.

I say "go for upping the limit to 128 bytes".

That said, if the aio user copy is _so_ critical that it's this
noticeable, there may be other issues. Sometimes _real_ cost of small
user copies is often the STAC/CLAC, more so than the "rep movs".

It would be interesting to know exactly which copy it is that matters
so much...  *inlining* the erms case might show that nicely in
profiles.

               Linus