linux-kernel - Re: [PATCH 5/7] percpu-rwsem: change it to rely on rss

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150715181507.GK3717@linux.vnet.ibm.com>
Date:	Wed, 15 Jul 2015 11:15:07 -0700
From:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To:	Oleg Nesterov <oleg@...hat.com>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Daniel Wagner <daniel.wagner@...-carit.de>,
	Davidlohr Bueso <dave@...olabs.net>,
	Ingo Molnar <mingo@...hat.com>, Tejun Heo <tj@...nel.org>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 5/7] percpu-rwsem: change it to rely on rss_sync
 infrastructure

On Sun, Jul 12, 2015 at 01:36:01AM +0200, Oleg Nesterov wrote:
> Currently down_write/up_write calls synchronize_sched_expedited()
> twice which is evil. Change this code to rely on rcu-sync primitives.
> This avoids the _expedited "big hammer", and this can be faster in
> the contended case or even in the case when a single thread does
> down_write/up_write in a loop.

But "evil" is such a strong word!  ;-)

More seriously, introducing a read-side smp_mb() should allow the
write-release synchronize_sched_expedited() to be demoted to an
smp_mb() as well.  But yes, you would still have the write-acquire
synchronize_sched_expedited().  That of course could be eliminated
in the writer-only common case in the same way you eliminated the
write-acquisition grace-period delay in rss_sync.

My main concern would be the introduction of an exclusive lock for the
writer, but if you are hitting the write-side acquisition that hard,
perhaps you are having other bigger problems.

Another concern would be performance in a workload where there are way
more readers than writers, but enough writers so that there is not much
more than one grace period's worth of spacing between them.  In that case,
the readers are having to hit the global lock more frequently with this
change than in the original.  Cue debate over tradeoff between disturbing
non-idle non-nohz CPUs on the one hand and forcing readers to hit the
global lock more frequently on the other.  ;-)

							Thanx, Paul

> Of course, a single down_write() will take more time, but otoh it
> will be much more friendly to the whole system.
> 
> To simplify the review this patch doesn't update the comments, fixed
> by the next change.
> 
> Signed-off-by: Oleg Nesterov <oleg@...hat.com>
> ---
>  include/linux/percpu-rwsem.h  |    3 ++-
>  kernel/locking/percpu-rwsem.c |   18 +++++++-----------
>  2 files changed, 9 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/percpu-rwsem.h b/include/linux/percpu-rwsem.h
> index 3e88c9a..3e58226 100644
> --- a/include/linux/percpu-rwsem.h
> +++ b/include/linux/percpu-rwsem.h
> @@ -5,11 +5,12 @@
>  #include <linux/rwsem.h>
>  #include <linux/percpu.h>
>  #include <linux/wait.h>
> +#include <linux/rcusync.h>
>  #include <linux/lockdep.h>
> 
>  struct percpu_rw_semaphore {
> +	struct rcu_sync_struct	rss;
>  	unsigned int __percpu	*fast_read_ctr;
> -	atomic_t		write_ctr;
>  	struct rw_semaphore	rw_sem;
>  	atomic_t		slow_read_ctr;
>  	wait_queue_head_t	write_waitq;
> diff --git a/kernel/locking/percpu-rwsem.c b/kernel/locking/percpu-rwsem.c
> index 652a8ee..69a7314 100644
> --- a/kernel/locking/percpu-rwsem.c
> +++ b/kernel/locking/percpu-rwsem.c
> @@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
> 
>  	/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
>  	__init_rwsem(&brw->rw_sem, name, rwsem_key);
> -	atomic_set(&brw->write_ctr, 0);
> +	rcu_sync_init(&brw->rss, RCU_SCHED_SYNC);
>  	atomic_set(&brw->slow_read_ctr, 0);
>  	init_waitqueue_head(&brw->write_waitq);
>  	return 0;
> @@ -25,6 +25,7 @@ int __percpu_init_rwsem(struct percpu_rw_semaphore *brw,
> 
>  void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
>  {
> +	rcu_sync_dtor(&brw->rss);
>  	free_percpu(brw->fast_read_ctr);
>  	brw->fast_read_ctr = NULL; /* catch use after free bugs */
>  }
> @@ -54,13 +55,12 @@ void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
>   */
>  static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
>  {
> -	bool success = false;
> +	bool success;
> 
>  	preempt_disable();
> -	if (likely(!atomic_read(&brw->write_ctr))) {
> +	success = rcu_sync_is_idle(&brw->rss);
> +	if (likely(success))
>  		__this_cpu_add(*brw->fast_read_ctr, val);
> -		success = true;
> -	}
>  	preempt_enable();
> 
>  	return success;
> @@ -126,8 +126,6 @@ static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
>   */
>  void percpu_down_write(struct percpu_rw_semaphore *brw)
>  {
> -	/* tell update_fast_ctr() there is a pending writer */
> -	atomic_inc(&brw->write_ctr);
>  	/*
>  	 * 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
>  	 *    so that update_fast_ctr() can't succeed.
> @@ -139,7 +137,7 @@ void percpu_down_write(struct percpu_rw_semaphore *brw)
>  	 *    fast-path, it executes a full memory barrier before we return.
>  	 *    See R_W case in the comment above update_fast_ctr().
>  	 */
> -	synchronize_sched_expedited();
> +	rcu_sync_enter(&brw->rss);
> 
>  	/* exclude other writers, and block the new readers completely */
>  	down_write(&brw->rw_sem);
> @@ -159,7 +157,5 @@ void percpu_up_write(struct percpu_rw_semaphore *brw)
>  	 * Insert the barrier before the next fast-path in down_read,
>  	 * see W_R case in the comment above update_fast_ctr().
>  	 */
> -	synchronize_sched_expedited();
> -	/* the last writer unblocks update_fast_ctr() */
> -	atomic_dec(&brw->write_ctr);
> +	rcu_sync_exit(&brw->rss);
>  }
> -- 
> 1.5.5.1
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/