[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230228160324.2a7c1012@gandalf.local.home>
Date: Tue, 28 Feb 2023 16:03:24 -0500
From: Steven Rostedt <rostedt@...dmis.org>
To: Joel Fernandes <joel@...lfernandes.org>
Cc: Uros Bizjak <ubizjak@...il.com>, rcu@...r.kernel.org,
linux-kernel@...r.kernel.org,
"Paul E. McKenney" <paulmck@...nel.org>,
Frederic Weisbecker <frederic@...nel.org>,
Neeraj Upadhyay <quic_neeraju@...cinc.com>,
Josh Triplett <josh@...htriplett.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Lai Jiangshan <jiangshanlai@...il.com>
Subject: Re: [PATCH] rcu: use try_cmpxchg in check_cpu_stall
On Tue, 28 Feb 2023 20:39:30 +0000
Joel Fernandes <joel@...lfernandes.org> wrote:
> On Tue, Feb 28, 2023 at 04:51:21PM +0100, Uros Bizjak wrote:
> > Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old in
> > check_cpu_stall. x86 CMPXCHG instruction returns success in ZF flag, so
> > this change saves a compare after cmpxchg (and related move instruction in
> > front of cmpxchg).
>
> In my codegen, I am not seeing mov instruction before the cmp removed, how
> can that be? The rax has to be populated with a mov before cmpxchg right?
>
> So try_cmpxchg gives: mov, cmpxchg, cmp, jne
> Where as cmpxchg gives: mov, cmpxchg, mov, jne
>
> So yeah you got rid of compare, but I am not seeing reduction in moves.
> Either way, I think it is an improvement due to dropping cmp so:
Did you get the above backwards?
Anyway, when looking at the conversion of cmpxchg() to try_cmpxchg() that
Uros sent to me for the ring buffer, the code went from:
0000000000000070 <ring_buffer_record_off>:
70: 48 8d 4f 08 lea 0x8(%rdi),%rcx
74: 8b 57 08 mov 0x8(%rdi),%edx
77: 89 d6 mov %edx,%esi
79: 89 d0 mov %edx,%eax
7b: 81 ce 00 00 10 00 or $0x100000,%esi
81: f0 0f b1 31 lock cmpxchg %esi,(%rcx)
85: 39 d0 cmp %edx,%eax
87: 75 eb jne 74 <ring_buffer_record_off+0x4>
89: e9 00 00 00 00 jmp 8e <ring_buffer_record_off+0x1e>
8a: R_X86_64_PLT32 __x86_return_thunk-0x4
8e: 66 90 xchg %ax,%ax
To
00000000000001a0 <ring_buffer_record_off>:
1a0: 8b 47 08 mov 0x8(%rdi),%eax
1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
1a7: 89 c2 mov %eax,%edx
1a9: 81 ca 00 00 10 00 or $0x100000,%edx
1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
1b7: R_X86_64_PLT32 __x86_return_thunk-0x4
1bb: 89 c2 mov %eax,%edx
1bd: 81 ca 00 00 10 00 or $0x100000,%edx
1c3: f0 0f b1 11 lock cmpxchg %edx,(%rcx)
1c7: 75 f2 jne 1bb <ring_buffer_record_off+0x1b>
1c9: e9 00 00 00 00 jmp 1ce <ring_buffer_record_off+0x2e>
1ca: R_X86_64_PLT32 __x86_return_thunk-0x4
1ce: 66 90 xchg %ax,%ax
It does add a bit more code, but the fast path seems better (where the
cmpxchg succeeds). That would be:
00000000000001a0 <ring_buffer_record_off>:
1a0: 8b 47 08 mov 0x8(%rdi),%eax
1a3: 48 8d 4f 08 lea 0x8(%rdi),%rcx
1a7: 89 c2 mov %eax,%edx
1a9: 81 ca 00 00 10 00 or $0x100000,%edx
1af: f0 0f b1 57 08 lock cmpxchg %edx,0x8(%rdi)
1b4: 75 05 jne 1bb <ring_buffer_record_off+0x1b>
1b6: e9 00 00 00 00 jmp 1bb <ring_buffer_record_off+0x1b>
1b7: R_X86_64_PLT32 __x86_return_thunk-0x4
Where there's only two moves and no cmp, where the former has 3 moves and a
cmp in the fast path.
-- Steve
Powered by blists - more mailing lists