lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHk-=wgQ5EEH3-GTK9KDB5mBmWjP25YHXC6_-V_KfWd0UTDTDQ@mail.gmail.com>
Date:   Thu, 7 Jan 2021 09:43:54 -0800
From:   Linus Torvalds <torvalds@...ux-foundation.org>
To:     kernel test robot <oliver.sang@...el.com>
Cc:     Al Viro <viro@...iv.linux.org.uk>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...en8.de>,
        Peter Zijlstra <peterz@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>, lkp@...ts.01.org,
        kernel test robot <lkp@...el.com>,
        "Huang, Ying" <ying.huang@...el.com>,
        Feng Tang <feng.tang@...el.com>, zhengjun.xing@...el.com
Subject: Re: [x86] d55564cfc2: will-it-scale.per_thread_ops -5.8% regression

On Thu, Jan 7, 2021 at 5:32 AM kernel test robot <oliver.sang@...el.com> wrote:
>
> FYI, we noticed a -5.8% regression of will-it-scale.per_thread_ops due to commit:

Ok, that's noticeable.

And:

> commit: d55564cfc222326e944893eff0c4118353e349ec ("x86: Make __put_user() generate an out-of-line call")

Yeah, that wasn't supposed to cause any performance regressions. No
core code should use __put_user() so much.

But:

> | testcase: change | will-it-scale: will-it-scale.per_process_ops -7.3% regression             |
> | test machine     | 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory |
> | test parameters  | cpufreq_governor=performance                                              |
> |                  | mode=process                                                              |
> |                  | nr_task=100%                                                              |
> |                  | test=poll2                                                                |
> |                  | ucode=0x16                                                                |

Ok, it's poll(), and it's definitely the __put_user() there:

>       0.00            +1.8        1.77 ą  3%  perf-profile.children.cycles-pp.__put_user_nocheck_2
>       0.00            +1.6        1.64 ą  3%  perf-profile.self.cycles-pp.__put_user_nocheck_2

And in fact, it's that final "write back the 16-bit revents field" at the end.

Which must have sucked before too, because it used to do a "stac/clac"
for every word - but now it does it out of line.

The fix is to convert that loop to use "unsafe_put_user()" with the
necessary accoutrements around it, and that should speed things up
quite nicely. The (double) loop itself is actually just 14
instructions, it's ridiculous how bad the code used to be, and how
much better it is with the nice unsafe_put_user(). The whole double
loop ends up being just

        lea    0x68(%rsp),%rsi
        mov    %rcx,%rax
  1:    mov    0x8(%rsi),%ecx
        lea    0xc(%rsi),%rdx
        test   %ecx,%ecx
        je     3f
        lea    (%rax,%rcx,8),%rdi
  2:    movzwl 0x6(%rdx),%ecx
        mov    %cx,0x6(%rax)
        add    $0x8,%rax
        add    $0x8,%rdx
        cmp    %rdi,%rax
        jne    2b
  3:    mov    (%rsi),%rsi
        test   %rsi,%rsi
        jne    1b

with the attached patch.

Before, it would do the whole CLAC/STAC dance inside that loop for
every entry (and with that commit d55564cfc22 it would be a function
call, of course).

Can you verify that this fixes the regression (and in fact I'd expect
it to improve that test-case)?

NOTE! The patch is entirely untested. I verified that the code
generation now looks sane, and it all looks ObviouslyCorrect(tm) to
me, but mistakes happen and maybe I missed some detail..

               Linus

Download attachment "patch" of type "application/octet-stream" (1078 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ