linux-kernel - Re: [x86] d55564cfc2: will-it-scale.per_thread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHk-=wgQ5EEH3-GTK9KDB5mBmWjP25YHXC6_-V_KfWd0UTDTDQ@mail.gmail.com>
Date:   Thu, 7 Jan 2021 09:43:54 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     kernel test robot <oliver.sang@...el.com>
Cc:     Al Viro <viro@...iv.linux.org.uk>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...en8.de>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
        kernel test robot <lkp@...el.com>,
        "Huang, Ying" <ying.huang@...el.com>,
        Feng Tang <feng.tang@...el.com>, zhengjun.xing@...el.com
Subject: Re: [x86] d55564cfc2: will-it-scale.per_thread_ops -5.8% regression

On Thu, Jan 7, 2021 at 5:32 AM kernel test robot <oliver.sang@...el.com> wrote:
>
> FYI, we noticed a -5.8% regression of will-it-scale.per_thread_ops due to commit:

Ok, that's noticeable.

And:

> commit: d55564cfc222326e944893eff0c4118353e349ec ("x86: Make __put_user() generate an out-of-line call")

Yeah, that wasn't supposed to cause any performance regressions. No
core code should use __put_user() so much.

But:

> | testcase: change | will-it-scale: will-it-scale.per_process_ops -7.3% regression             |
> | test machine     | 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
> | test parameters  | cpufreq_governor=performance                                              |
> |                  | mode=process                                                              |
> |                  | nr_task=100%                                                              |
> |                  | test=poll2                                                                |
> |                  | ucode=0x16                                                                |

Ok, it's poll(), and it's definitely the __put_user() there:

>       0.00            +1.8        1.77 ą  3%  perf-profile.children.cycles-pp.__put_user_nocheck_2
>       0.00            +1.6        1.64 ą  3%  perf-profile.self.cycles-pp.__put_user_nocheck_2

And in fact, it's that final "write back the 16-bit revents field" at the end.

Which must have sucked before too, because it used to do a "stac/clac"
for every word - but now it does it out of line.

The fix is to convert that loop to use "unsafe_put_user()" with the
necessary accoutrements around it, and that should speed things up
quite nicely. The (double) loop itself is actually just 14
instructions, it's ridiculous how bad the code used to be, and how
much better it is with the nice unsafe_put_user(). The whole double
loop ends up being just

        lea    0x68(%rsp),%rsi
        mov    %rcx,%rax
  1:    mov    0x8(%rsi),%ecx
        lea    0xc(%rsi),%rdx
        test   %ecx,%ecx
        je     3f
        lea    (%rax,%rcx,8),%rdi
  2:    movzwl 0x6(%rdx),%ecx
        mov    %cx,0x6(%rax)
        add    $0x8,%rax
        add    $0x8,%rdx
        cmp    %rdi,%rax
        jne    2b
  3:    mov    (%rsi),%rsi
        test   %rsi,%rsi
        jne    1b

with the attached patch.

Before, it would do the whole CLAC/STAC dance inside that loop for
every entry (and with that commit d55564cfc22 it would be a function
call, of course).

Can you verify that this fixes the regression (and in fact I'd expect
it to improve that test-case)?

NOTE! The patch is entirely untested. I verified that the code
generation now looks sane, and it all looks ObviouslyCorrect(tm) to
me, but mistakes happen and maybe I missed some detail..

               Linus

Download attachment "patch" of type "application/octet-stream" (1078 bytes)