linux-kernel - Re: [PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAFULd4ZA2w+rz==cw1e18yJ9sdLpAi8XLZ7JU104yyD0Maspdw@mail.gmail.com>
Date:   Fri, 17 Nov 2023 17:37:43 +0100
From:   Uros Bizjak <ubizjak@...il.com>
To:     Dave Hansen <dave.hansen@...el.com>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>
Subject: Re: [PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

On Fri, Nov 17, 2023 at 5:18 PM Dave Hansen <dave.hansen@...el.com> wrote:
>
> On 11/14/23 08:43, Uros Bizjak wrote:
> > Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old
> > in native_stop_other_cpus(). On x86 the CMPXCHG instruction returns success
> > in the ZF flag, so this change saves a compare after CMPXCHG. Together
> > with a small code reorder, the generated asm code improves from:
> >
> >   74: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
> >   7a: 41 54                   push   %r12
> >   7c: 55                      push   %rbp
> >   7d: 65 8b 2d 00 00 00 00    mov    %gs:0x0(%rip),%ebp
> >   84: 53                      push   %rbx
> >   85: 85 c0                   test   %eax,%eax
> >   87: 75 71                   jne    fa <native_stop_other_cpus+0x8a>
> >   89: b8 ff ff ff ff          mov    $0xffffffff,%eax
> >   8e: f0 0f b1 2d 00 00 00    lock cmpxchg %ebp,0x0(%rip)
> >   95: 00
> >   96: 83 f8 ff                cmp    $0xffffffff,%eax
> >   99: 75 5f                   jne    fa <native_stop_other_cpus+0x8a>
> >
> > to:
> >
> >   74: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
> >   7a: 85 c0                   test   %eax,%eax
> >   7c: 0f 85 84 00 00 00       jne    106 <native_stop_other_cpus+0x96>
> >   82: 41 54                   push   %r12
> >   84: b8 ff ff ff ff          mov    $0xffffffff,%eax
> >   89: 55                      push   %rbp
> >   8a: 53                      push   %rbx
> >   8b: 65 8b 1d 00 00 00 00    mov    %gs:0x0(%rip),%ebx
> >   92: f0 0f b1 1d 00 00 00    lock cmpxchg %ebx,0x0(%rip)
> >   99: 00
> >   9a: 75 5e                   jne    fa <native_stop_other_cpus+0x8a>
> >
> > Please note early exit and lack of CMP after CMPXCHG.
>
> Uros, I really do appreciate that you are trying to optimize these
> paths.  But the thing we have to balance is the _need_ for optimization
> with the chance that this will break something.
>
> This is about as much of a slow path as we have in the kernel.  It's
> only used at shutdown, right?  That means this is one of the places in
> the kernel that least needs optimization.  It can only possibly get run
> once per boot.
>
> So, the benefit is that it might make this code a few cycles faster.  In
> practice, it might not even be measurably faster.
>
> On the other hand, this is relatively untested and also makes the C code
> more complicated.
>
> Is there some other side benefit that I'm missing here?  Applying this
> patch doesn't seem to have a great risk/reward ratio.

Yes, in addition to better asm code, I think that the use of magic
constant (-1) is not descriptive at all. I tried to make this code
look like nmi_panic() from kernel/panic.c, which has similar
functionality, and describe that this constant belongs to old_cpu
(same as in nmi_panic() ). Also, from converting many cmpxchg to
try_cmpxchg, it becomes evident that in cases like this (usage in "if"
clauses) the correct locking primitive is try_cmpxchg. Additionally,
in this particular case, it is not the speed, but a little code save
that can be achieved with the same functionality.

Thanks,
Uros.