linux-kernel - Re: [PATCH 2/2] x86/percpu: Use raw_cpu_try_cmpxchg in preempt_count

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAFULd4ZJDE5Hp9DFNpk+pFbCAC2=asEm1eLmQxy2uOWRbLkRwQ@mail.gmail.com>
Date:   Fri, 15 Sep 2023 14:01:59 +0200
From:   Uros Bizjak <ubizjak@...il.com>
To:     Ingo Molnar <mingo@...nel.org>
Cc:     x86@...nel.org, linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "H. Peter Anvin" <hpa@...or.com>
Subject: Re: [PATCH 2/2] x86/percpu: Use raw_cpu_try_cmpxchg in preempt_count_set

On Fri, Sep 15, 2023 at 11:47 AM Ingo Molnar <mingo@...nel.org> wrote:
>
>
> * Uros Bizjak <ubizjak@...il.com> wrote:
>
> > Use raw_cpu_try_cmpxchg instead of raw_cpu_cmpxchg (*ptr, old, new) == old.
> > x86 CMPXCHG instruction returns success in ZF flag, so this change saves a
> > compare after cmpxchg (and related move instruction in front of cmpxchg).
> >
> > Also, raw_cpu_try_cmpxchg implicitly assigns old *ptr value to "old" when
> > cmpxchg fails. There is no need to re-read the value in the loop.
> >
> > No functional change intended.
> >
> > Cc: Peter Zijlstra <peterz@...radead.org>
> > Cc: Thomas Gleixner <tglx@...utronix.de>
> > Cc: Ingo Molnar <mingo@...hat.com>
> > Cc: Borislav Petkov <bp@...en8.de>
> > Cc: Dave Hansen <dave.hansen@...ux.intel.com>
> > Cc: "H. Peter Anvin" <hpa@...or.com>
> > Signed-off-by: Uros Bizjak <ubizjak@...il.com>
> > ---
> >  arch/x86/include/asm/preempt.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
> > index 2d13f25b1bd8..4527e1430c6d 100644
> > --- a/arch/x86/include/asm/preempt.h
> > +++ b/arch/x86/include/asm/preempt.h
> > @@ -31,11 +31,11 @@ static __always_inline void preempt_count_set(int pc)
> >  {
> >       int old, new;
> >
> > +     old = raw_cpu_read_4(pcpu_hot.preempt_count);
> >       do {
> > -             old = raw_cpu_read_4(pcpu_hot.preempt_count);
> >               new = (old & PREEMPT_NEED_RESCHED) |
> >                       (pc & ~PREEMPT_NEED_RESCHED);
> > -     } while (raw_cpu_cmpxchg_4(pcpu_hot.preempt_count, old, new) != old);
> > +     } while (!raw_cpu_try_cmpxchg_4(pcpu_hot.preempt_count, &old, new));
>
> It would be really nice to have a before/after comparison of generated
> assembly code in the changelog, to demonstrate the effectiveness of this
> optimization.

The  assembly code improvements are in line with other try_cmpxchg
conversions, but for reference, finish_task_switch() from
kernel/sched/core.c that inlines preempt_count_set() improves from:

    5bad:    65 8b 0d 00 00 00 00     mov    %gs:0x0(%rip),%ecx
    5bb4:    89 ca                    mov    %ecx,%edx
    5bb6:    89 c8                    mov    %ecx,%eax
    5bb8:    81 e2 00 00 00 80        and    $0x80000000,%edx
    5bbe:    83 ca 02                 or     $0x2,%edx
    5bc1:    65 0f b1 15 00 00 00     cmpxchg %edx,%gs:0x0(%rip)
    5bc8:    00
    5bc9:    39 c1                    cmp    %eax,%ecx
    5bcb:    75 e0                    jne    5bad <...>
    5bcd:    e9 5a fe ff ff           jmpq   5a2c <...>
    5bd2:

to:

    5bad:    65 8b 05 00 00 00 00     mov    %gs:0x0(%rip),%eax
    5bb4:    89 c2                    mov    %eax,%edx
    5bb6:    81 e2 00 00 00 80        and    $0x80000000,%edx
    5bbc:    83 ca 02                 or     $0x2,%edx
    5bbf:    65 0f b1 15 00 00 00     cmpxchg %edx,%gs:0x0(%rip)
    5bc6:    00
    5bc7:    0f 84 5f fe ff ff        je     5a2c <...>
    5bcd:    eb e5                    jmp    5bb4 <...>
    5bcf:

Please note missing cmp (and mov), loop without extra memory load from
%gs:0x0(%rip) and better predicted jump in the later case. The
improvements with {raw,this}_cpu_try_cmpxchg_128 in the third patch
are even more noticeable, because __int128 value lives in a register
pair, so the comparison needs three separate machine instructions, in
addition to a move of the register pair.

Thanks,
Uros.