linux-kernel - Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput -16.9% regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wjCUckvZUQf7gqp2ziJUWxVpikM_6srFdbcNdBJTxExRg@mail.gmail.com>
Date:   Wed, 15 Nov 2023 12:38:38 -0500
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     David Howells <dhowells@...hat.com>
Cc:     kernel test robot <oliver.sang@...el.com>, oe-lkp@...ts.linux.dev,
        lkp@...el.com, linux-kernel@...r.kernel.org,
        Christian Brauner <brauner@...nel.org>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Jens Axboe <axboe@...nel.dk>, Christoph Hellwig <hch@....de>,
        Christian Brauner <christian@...uner.io>,
        Matthew Wilcox <willy@...radead.org>,
        David Laight <David.Laight@...lab.com>, ying.huang@...el.com,
        feng.tang@...el.com, fengwei.yin@...el.com
Subject: Re: [linus:master] [iov_iter] c9eec08bac: vm-scalability.throughput
 -16.9% regression

On Wed, 15 Nov 2023 at 11:53, Linus Torvalds
<torvalds@...ux-foundation.org> wrote:
>
> I wonder if gcc somehow decided to inline "memcpy()" in
> memcpy_from_iter() as a "rep movsb" because of other inlining changes?
>
> [ Goes out to look ]
>
> Yup, I think that's exactly what happened. Gcc seems to decide that it
> might be a small memcpy(), and seems to do at least part of it
> directly.
>
> So I *think* this all is mainly an artifact of gcc having changed code
> generation due to the code re-organization.

The gcc code generation here is *really* odd. I've never seen this
before, so it may be new to newer versions of gcc. I see code like
this:

# lib/iov_iter.c:73:    memcpy(to + progress, iter_from, len);
        cmpl    $8, %edx        #, _88
        jb      .L400   #,
        movq    (%rsi), %rax    #, tmp288
        movq    %rax, (%rcx)    # tmp288,
        movl    %edx, %eax      # _88, _88
        movq    -8(%rsi,%rax), %rdi     #, tmp295
        movq    %rdi, -8(%rcx,%rax)     # tmp295,
        leaq    8(%rcx), %rdi   #, tmp296
        andq    $-8, %rdi       #, tmp296
        subq    %rdi, %rcx      # tmp296, tmp268
        subq    %rcx, %rsi      # tmp268, tmp269
        addl    %edx, %ecx      # _88, _88
        shrl    $3, %ecx        #,
        rep movsq
        jmp     .L392   #

.L398:
# lib/iov_iter.c:73:    memcpy(to + progress, iter_from, len);
        movl    (%rsi), %eax    #, tmp271
        movl    %eax, (%rcx)    # tmp271,
        movl    %edx, %eax      # _88, _88
        movl    -4(%rsi,%rax), %esi     #, tmp278
        movl    %esi, -4(%rcx,%rax)     # tmp278,
        movl    8(%r9), %edi    # p_72->bv_len, p_72->bv_len
        jmp     .L330   #
...

.L400:
# lib/iov_iter.c:73:    memcpy(to + progress, iter_from, len);
        testb   $4, %dl #, _88
        jne     .L398   #,
        testl   %edx, %edx      # _88
        je      .L330   #,
        movzbl  (%rsi), %eax    #, tmp279
        movb    %al, (%rcx)     # tmp279,
        testb   $2, %dl #, _88
        jne     .L390   #,
...

which makes *zero* sense. It first checks that the the length is at
least 8 bytes, then it moves *one* word by hand, then it aligns the
code to 8 bytes remaining, and does the remaining (possibly
overlapping at the beginning) words as one "rep movsq",

And L398 is the "I have 4..7 bytes to copy" target.

And L400 seems to be "I have 0..7 bytes to copy".

This is literally insane. And it seems to be all just gcc having for
some reason decided to do this instead of "rep movsb" or calling an
out-of-line function.

I get the feeling that this is related to how your patches made that
function be an inline function that is inlined through a function
pointer. I suspect that what happens is that gcc expands the memcpy()
first into that inlined function (without caller context), and then
inserts the crazily expanded inline later into the context of that
function pointer.

I dunno. I really only say that because I haven't seen gcc make this
kind of mess before, and that "inlined through a function pointer" is
the main unusual thing here.

How very annoying.

                 Linus