linux-kernel - Re: [RFC] Improve memset

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wjdpJ+VapXfoZE8JRUfvMb8JrVTZe0=TDFYZ-ke+uqBOA@mail.gmail.com>
Date:   Mon, 16 Sep 2019 10:25:25 -0700
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     Rasmus Villemoes <linux@...musvillemoes.dk>
Cc:     Borislav Petkov <bp@...en8.de>,
        Rasmus Villemoes <mail@...musvillemoes.dk>,
        x86-ml <x86@...nel.org>, Andy Lutomirski <luto@...nel.org>,
        Josh Poimboeuf <jpoimboe@...hat.com>,
        lkml <linux-kernel@...r.kernel.org>
Subject: Re: [RFC] Improve memset

On Mon, Sep 16, 2019 at 2:18 AM Rasmus Villemoes
<linux@...musvillemoes.dk> wrote:
>
> Eh, this benchmark doesn't seem to provide any hints on where to set the
> cut-off for a compile-time constant n, i.e. the 32 in

Yes, you'd need to use proper fixed-size memset's with
__builtin_memset() to test that case. Probably easy enough with some
preprocessor macros to expand to a lot of cases.

But even then it will not show some of the advantages of inlining the
memset (quite often you have a "memset structure to zero, then
initialize a couple of fields" pattern, and gcc does much better for
that when it just inlines the memset to stores - to the point of just
removing all the memset entirely and just storing a couple of zeroes
between the fields you initialized).

So the "inline constant sizes" case has advantages over and beyond the
obvious ones. I suspect that a reasonable cut-off point is somethinig
like "8*sizeof(long)". But look at things like "struct kstat" uses
etc, the limit might actually be even higher than that.

Also note that while "rep stosb" is _reasonably_ good with current
CPU's (ie roughly gen 8+), it's not so great a few generations ago
(gen 6ish), and it can be absolutely horrid on older cores and/or
atom. The limit for when it is a win ends up depending on whether I$
footprint is an issue too, of course, but some of the bigger wins tend
to happen when you have sizes >= 128.

You can basically always beat "rep movs/stos" with hand-tuned AVX2/512
code for specific cases if you don't look at I$ footprint and the cost
of the AVX setup (and the cost of frequency changes, which often go
hand-in-hand with the AVX use). So "rep movs/stos" is seldom
_optimal_, but it tends to be "quite good" for modern CPU's with
variable sizes that are in the 100+ byte range.

             Linus