linux-kernel - Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=whtDxahdzn4yLP_3BNb496AQ0y5QrE36JVLUkqRM+un5A@mail.gmail.com>
Date:   Wed, 15 Nov 2023 14:09:40 -0500
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     David Howells <dhowells@...hat.com>
Cc:     kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
        lkp@...el.com, linux-kernel@...r.kernel.org,
        Christian Brauner <brauner@...nel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
        Christian Brauner <christian@...uner.io>,
        Matthew Wilcox <willy@...radead.org>,
        David Laight <David.Laight@...lab.com>, ying.huang@...el.com,
        feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput
 -16.9% regression

On Wed, 15 Nov 2023 at 13:45, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> Do you perhaps have CONFIG_CC_OPTIMIZE_FOR_SIZE set? That makes gcc
> use "rep movsb" - even for small copies that most definitely should
> *not* use "rep movsb".

Just to give some background an an example:

        __builtin_memcpy(dst, src, 24);

with -O2 is done as three 64-bit move instructions (well, three in
both direction, so six instructions total), and with -Os you get

        movl $6, %ecx
        rep movsl

instead.  And no, this isn't all that uncommon, because things like
the above is what happens when you copy a small structure around.

And that "rep movsl" is indeed nice and small, but it's truly
horrendously bad from a performance angle on most cores, compared to
the six instructions that can schedule nicely and take a cycle or two.

There are some other cases of similar "-Os generates unacceptable
code". For example, dividing by a constant - when you use -Os, gcc
thinks that it's perfectly fine to actually generate a divide
instruction, because it is indeed small.

But in most cases you really *really* want to use a "multiply by
reciprocal" even though it generates bigger code. Again, it ends up
depending on microarchitecture, and modern cores tend to do better on
divides, but it's another of those things where saving a copuple of
bytes of code space is not the right choice if it means that you use a
slow divider.

And again, those "divide by constant" often happen in implicit
contexts (ie the constant may be the size of a structure, and the
divide is due to taking a pointer difference). Let's say you have a
structure that isn't a power of two, but is (to pick a random but not
unlikely value) is 56 bytes in size.

The code generation for -O2 is (value in %rdi)

        movabsq $2635249153387078803, %rax
        shrq $3, %rdi
        mulq %rdi

and for -Os you get (value in %rax):

        movl $56, %ecx
        xorl %edx, %edx
        divq %rcx

and that 'divq' is certainly again smaller and more obvious, but again
we're talking "single cycles" vs "potentially 50+ cycles" depending on
uarch.

                  Linus