linux-kernel - Re: [PATCH] rcu: use try_cmpxchg in check_cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20230228160324.2a7c1012@gandalf.local.home>
Date:   Tue, 28 Feb 2023 16:03:24 -0500
From:   Steven Rostedt <rostedt@...dmis.org>
To:     Joel Fernandes <joel@...lfernandes.org>
Cc:     Uros Bizjak <ubizjak@...il.com>, rcu@...r.kernel.org,
        linux-kernel@...r.kernel.org,
        "Paul E. McKenney" <paulmck@...nel.org>,
        Frederic Weisbecker <frederic@...nel.org>,
        Neeraj Upadhyay <quic_neeraju@...cinc.com>,
        Josh Triplett <josh@...htriplett.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>
Subject: Re: [PATCH] rcu: use try_cmpxchg in check_cpu_stall

On Tue, 28 Feb 2023 20:39:30 +0000
Joel Fernandes <joel@...lfernandes.org> wrote:

> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > check_cpu_stall.  x86 CMPXCHG instruction returns success in ZF flag, so
> > this change saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).  
> 
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
> 
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne
> 
> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:

Did you get the above backwards?

Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
Uros sent to me for the ring buffer, the code went from:

0000000000000070 <ring_buffer_record_off>:
      70:       48 8d 4f 08             lea    0x8(%rdi),%rcx
      74:       8b 57 08                mov    0x8(%rdi),%edx
      77:       89 d6                   mov    %edx,%esi
      79:       89 d0                   mov    %edx,%eax
      7b:       81 ce 00 00 10 00       or     $0x100000,%esi
      81:       f0 0f b1 31             lock cmpxchg %esi,(%rcx)
      85:       39 d0                   cmp    %edx,%eax
      87:       75 eb                   jne    74 <ring_buffer_record_off+0x4>
      89:       e9 00 00 00 00          jmp    8e <ring_buffer_record_off+0x1e>
                        8a: R_X86_64_PLT32      __x86_return_thunk-0x4
      8e:       66 90                   xchg   %ax,%ax


  To

00000000000001a0 <ring_buffer_record_off>:
     1a0:       8b 47 08                mov    0x8(%rdi),%eax
     1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
     1a7:       89 c2                   mov    %eax,%edx
     1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
     1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
     1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
     1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
                        1b7: R_X86_64_PLT32     __x86_return_thunk-0x4
     1bb:       89 c2                   mov    %eax,%edx
     1bd:       81 ca 00 00 10 00       or     $0x100000,%edx
     1c3:       f0 0f b1 11             lock cmpxchg %edx,(%rcx)
     1c7:       75 f2                   jne    1bb <ring_buffer_record_off+0x1b>
     1c9:       e9 00 00 00 00          jmp    1ce <ring_buffer_record_off+0x2e>
                        1ca: R_X86_64_PLT32     __x86_return_thunk-0x4
     1ce:       66 90                   xchg   %ax,%ax


It does add a bit more code, but the fast path seems better (where the
cmpxchg succeeds). That would be:

00000000000001a0 <ring_buffer_record_off>:
     1a0:       8b 47 08                mov    0x8(%rdi),%eax
     1a3:       48 8d 4f 08             lea    0x8(%rdi),%rcx
     1a7:       89 c2                   mov    %eax,%edx
     1a9:       81 ca 00 00 10 00       or     $0x100000,%edx
     1af:       f0 0f b1 57 08          lock cmpxchg %edx,0x8(%rdi)
     1b4:       75 05                   jne    1bb <ring_buffer_record_off+0x1b>
     1b6:       e9 00 00 00 00          jmp    1bb <ring_buffer_record_off+0x1b>
                        1b7: R_X86_64_PLT32     __x86_return_thunk-0x4


Where there's only two moves and no cmp, where the former has 3 moves and a
cmp in the fast path.

-- Steve