[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YZZNv3JflBYwRjdd@hirez.programming.kicks-ass.net>
Date: Thu, 18 Nov 2021 13:57:35 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Muchun Song <songmuchun@...edance.com>
Cc: mingo@...hat.com, will@...nel.org, longman@...hat.com,
boqun.feng@...il.com, linux-kernel@...r.kernel.org,
duanxiongchun@...edance.com, zhengqi.arch@...edance.com
Subject: Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly
contended case
On Thu, Nov 18, 2021 at 05:44:55PM +0800, Muchun Song wrote:
> By using the above benchmark, the real executing time on a x86-64 system
> before and after the patch were:
What kind of x86_64 ?
>
> Before Patch After Patch
> # of Threads real real reduced by
> ------------ ------ ------ ----------
> 1 65,373 65,206 ~0.0%
> 4 15,467 15,378 ~0.5%
> 40 6,214 5,528 ~11.0%
>
> For the uncontended case, the new down_read_trylock() is the same as
> before. For the contended cases, the new down_read_trylock() is faster
> than before. The more contended, the more fast.
>
> Signed-off-by: Muchun Song <songmuchun@...edance.com>
> ---
> kernel/locking/rwsem.c | 11 ++++-------
> 1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index c51387a43265..ef2b2a3f508c 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
>
> DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
>
> - /*
> - * Optimize for the case when the rwsem is not locked at all.
> - */
> - tmp = RWSEM_UNLOCKED_VALUE;
> - do {
> + tmp = atomic_long_read(&sem->count);
> + while (!(tmp & RWSEM_READ_FAILED_MASK)) {
> if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
> - tmp + RWSEM_READER_BIAS)) {
> + tmp + RWSEM_READER_BIAS)) {
> rwsem_set_reader_owned(sem);
> return 1;
> }
> - } while (!(tmp & RWSEM_READ_FAILED_MASK));
> + }
> return 0;
> }
This is weird... so the only difference is that leading load, but given
contention you'd expect that load to stall, also, given it's a
non-exclusive load, to get stolen by a competing CPU. Whereas the old
code would start with a cmpxchg, which obviously will also stall, but
does an exclusive load.
And the thinking is that the exclusive load and the presence of the
cmpxchg loop would keep the line on that CPU for a little while and
progress is made.
Clearly this isn't working as expected. Also I suppose it would need
wider testing...
Powered by blists - more mailing lists