[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wjdpJ+VapXfoZE8JRUfvMb8JrVTZe0=TDFYZ-ke+uqBOA@mail.gmail.com>
Date: Mon, 16 Sep 2019 10:25:25 -0700
From: Linus Torvalds <torvalds@...ux-foundation.org>
To: Rasmus Villemoes <linux@...musvillemoes.dk>
Cc: Borislav Petkov <bp@...en8.de>,
Rasmus Villemoes <mail@...musvillemoes.dk>,
x86-ml <x86@...nel.org>, Andy Lutomirski <luto@...nel.org>,
Josh Poimboeuf <jpoimboe@...hat.com>,
lkml <linux-kernel@...r.kernel.org>
Subject: Re: [RFC] Improve memset
On Mon, Sep 16, 2019 at 2:18 AM Rasmus Villemoes
<linux@...musvillemoes.dk> wrote:
>
> Eh, this benchmark doesn't seem to provide any hints on where to set the
> cut-off for a compile-time constant n, i.e. the 32 in
Yes, you'd need to use proper fixed-size memset's with
__builtin_memset() to test that case. Probably easy enough with some
preprocessor macros to expand to a lot of cases.
But even then it will not show some of the advantages of inlining the
memset (quite often you have a "memset structure to zero, then
initialize a couple of fields" pattern, and gcc does much better for
that when it just inlines the memset to stores - to the point of just
removing all the memset entirely and just storing a couple of zeroes
between the fields you initialized).
So the "inline constant sizes" case has advantages over and beyond the
obvious ones. I suspect that a reasonable cut-off point is somethinig
like "8*sizeof(long)". But look at things like "struct kstat" uses
etc, the limit might actually be even higher than that.
Also note that while "rep stosb" is _reasonably_ good with current
CPU's (ie roughly gen 8+), it's not so great a few generations ago
(gen 6ish), and it can be absolutely horrid on older cores and/or
atom. The limit for when it is a win ends up depending on whether I$
footprint is an issue too, of course, but some of the bigger wins tend
to happen when you have sizes >= 128.
You can basically always beat "rep movs/stos" with hand-tuned AVX2/512
code for specific cases if you don't look at I$ footprint and the cost
of the AVX setup (and the cost of frequency changes, which often go
hand-in-hand with the AVX use). So "rep movs/stos" is seldom
_optimal_, but it tends to be "quite good" for modern CPU's with
variable sizes that are in the 100+ byte range.
Linus
Powered by blists - more mailing lists