linux-kernel - Re: [PATCH] locking/rwsem: Optimize down_read

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YZZNv3JflBYwRjdd@hirez.programming.kicks-ass.net>
Date:   Thu, 18 Nov 2021 13:57:35 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Muchun Song <songmuchun@...edance.com>
Cc:     mingo@...hat.com, will@...nel.org, longman@...hat.com,
        boqun.feng@...il.com, linux-kernel@...r.kernel.org,
        duanxiongchun@...edance.com, zhengqi.arch@...edance.com
Subject: Re: [PATCH] locking/rwsem: Optimize down_read_trylock() under highly
 contended case

On Thu, Nov 18, 2021 at 05:44:55PM +0800, Muchun Song wrote:

> By using the above benchmark, the real executing time on a x86-64 system
> before and after the patch were:

What kind of x86_64 ?

> 
>                   Before Patch  After Patch
>    # of Threads      real          real     reduced by
>    ------------     ------        ------    ----------
>          1          65,373        65,206       ~0.0%
>          4          15,467        15,378       ~0.5%
>         40           6,214         5,528      ~11.0%
> 
> For the uncontended case, the new down_read_trylock() is the same as
> before. For the contended cases, the new down_read_trylock() is faster
> than before. The more contended, the more fast.
> 
> Signed-off-by: Muchun Song <songmuchun@...edance.com>
> ---
>  kernel/locking/rwsem.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> index c51387a43265..ef2b2a3f508c 100644
> --- a/kernel/locking/rwsem.c
> +++ b/kernel/locking/rwsem.c
> @@ -1249,17 +1249,14 @@ static inline int __down_read_trylock(struct rw_semaphore *sem)
>  
>  	DEBUG_RWSEMS_WARN_ON(sem->magic != sem, sem);
>  
> -	/*
> -	 * Optimize for the case when the rwsem is not locked at all.
> -	 */
> -	tmp = RWSEM_UNLOCKED_VALUE;
> -	do {
> +	tmp = atomic_long_read(&sem->count);
> +	while (!(tmp & RWSEM_READ_FAILED_MASK)) {
>  		if (atomic_long_try_cmpxchg_acquire(&sem->count, &tmp,
> -					tmp + RWSEM_READER_BIAS)) {
> +						    tmp + RWSEM_READER_BIAS)) {
>  			rwsem_set_reader_owned(sem);
>  			return 1;
>  		}
> -	} while (!(tmp & RWSEM_READ_FAILED_MASK));
> +	}
>  	return 0;
>  }

This is weird... so the only difference is that leading load, but given
contention you'd expect that load to stall, also, given it's a
non-exclusive load, to get stolen by a competing CPU. Whereas the old
code would start with a cmpxchg, which obviously will also stall, but
does an exclusive load.

And the thinking is that the exclusive load and the presence of the
cmpxchg loop would keep the line on that CPU for a little while and
progress is made.

Clearly this isn't working as expected. Also I suppose it would need
wider testing...