lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 8 Nov 2012 18:41:10 -0500 (EST)
From:	Mikulas Patocka <mpatocka@...hat.com>
To:	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Oleg Nesterov <oleg@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...e.hu>,
	Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
	Ananth N Mavinakayanahalli <ananth@...ibm.com>,
	Anton Arapov <anton@...hat.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to not
 block the readers unnecessarily



On Thu, 8 Nov 2012, Paul E. McKenney wrote:

> On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > On Thu, 8 Nov 2012 14:48:49 +0100
> > Oleg Nesterov <oleg@...hat.com> wrote:
> > 
> > > Currently the writer does msleep() plus synchronize_sched() 3 times
> > > to acquire/release the semaphore, and during this time the readers
> > > are blocked completely. Even if the "write" section was not actually
> > > started or if it was already finished.
> > > 
> > > With this patch down_write/up_write does synchronize_sched() twice
> > > and down_read/up_read are still possible during this time, just they
> > > use the slow path.
> > > 
> > > percpu_down_write() first forces the readers to use rw_semaphore and
> > > increment the "slow" counter to take the lock for reading, then it
> > > takes that rw_semaphore for writing and blocks the readers.
> > > 
> > > Also. With this patch the code relies on the documented behaviour of
> > > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > > barrier.
> > > 
> > > ...
> > >
> > >  include/linux/percpu-rwsem.h |   83 +++++------------------------
> > >  lib/Makefile                 |    2 +-
> > >  lib/percpu-rwsem.c           |  123 ++++++++++++++++++++++++++++++++++++++++++
> > 
> > The patch also uninlines everything.
> > 
> > And it didn't export the resulting symbols to modules, so it isn't an
> > equivalent.  We can export thing later if needed I guess.
> > 
> > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > avoid including the code altogether, methinks?
> > 
> > >
> > > ...
> > >
> > > --- /dev/null
> > > +++ b/lib/percpu-rwsem.c
> > > @@ -0,0 +1,123 @@
> > 
> > That was nice and terse ;)
> > 
> > > +#include <linux/percpu-rwsem.h>
> > > +#include <linux/rcupdate.h>
> > > +#include <linux/sched.h>
> > 
> > This list is nowhere near sufficient to support this file's
> > requirements.  atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > more.  IOW, if it compiles, it was sheer luck.
> > 
> > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	brw->fast_read_ctr = alloc_percpu(int);
> > > +	if (unlikely(!brw->fast_read_ctr))
> > > +		return -ENOMEM;
> > > +
> > > +	mutex_init(&brw->writer_mutex);
> > > +	init_rwsem(&brw->rw_sem);
> > > +	atomic_set(&brw->slow_read_ctr, 0);
> > > +	init_waitqueue_head(&brw->write_waitq);
> > > +	return 0;
> > > +}
> > > +
> > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	free_percpu(brw->fast_read_ctr);
> > > +	brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > > +}
> > > +
> > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
> > > +{
> > > +	bool success = false;
> > > +
> > > +	preempt_disable();
> > > +	if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> > > +		__this_cpu_add(*brw->fast_read_ctr, val);
> > > +		success = true;
> > > +	}
> > > +	preempt_enable();
> > > +
> > > +	return success;
> > > +}
> > > +
> > > +/*
> > > + * Like the normal down_read() this is not recursive, the writer can
> > > + * come after the first percpu_down_read() and create the deadlock.
> > > + */
> > > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	if (likely(update_fast_ctr(brw, +1)))
> > > +		return;
> > > +
> > > +	down_read(&brw->rw_sem);
> > > +	atomic_inc(&brw->slow_read_ctr);
> > > +	up_read(&brw->rw_sem);
> > > +}
> > > +
> > > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	if (likely(update_fast_ctr(brw, -1)))
> > > +		return;
> > > +
> > > +	/* false-positive is possible but harmless */
> > > +	if (atomic_dec_and_test(&brw->slow_read_ctr))
> > > +		wake_up_all(&brw->write_waitq);
> > > +}
> > > +
> > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	unsigned int sum = 0;
> > > +	int cpu;
> > > +
> > > +	for_each_possible_cpu(cpu) {
> > > +		sum += per_cpu(*brw->fast_read_ctr, cpu);
> > > +		per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > > +	}
> > > +
> > > +	return sum;
> > > +}
> > > +
> > > +/*
> > > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > > + * update_fast_ctr().
> > > + *
> > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > > + * counter it represents the number of active readers.
> > > + *
> > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> > > + * then waits until the slow counter becomes zero.
> > > + */
> > 
> > Some overview of how fast/slow_read_ctr are supposed to work would be
> > useful.  This comment seems to assume that the reader already knew
> > that.
> > 
> > > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > > +{
> > > +	/* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > > +	mutex_lock(&brw->writer_mutex);
> > > +
> > > +	/*
> > > +	 * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > > +	 *    so that update_fast_ctr() can't succeed.
> > > +	 *
> > > +	 * 2. Ensures we see the result of every previous this_cpu_add() in
> > > +	 *    update_fast_ctr().
> > > +	 *
> > > +	 * 3. Ensures that if any reader has exited its critical section via
> > > +	 *    fast-path, it executes a full memory barrier before we return.
> > > +	 */
> > > +	synchronize_sched();
> > 
> > Here's where I get horridly confused.  Your patch completely deRCUifies
> > this code, yes?  Yet here we're using an RCU primitive.  And we seem to
> > be using it not as an RCU primitive but as a handy thing which happens
> > to have desirable side-effects.  But the implementation of
> > synchronize_sched() differs considerably according to which rcu
> > flavor-of-the-minute you're using.
> 
> The trick is that the preempt_disable() call in update_fast_ctr()
> acts as an RCU read-side critical section WRT synchronize_sched().
> 
> The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
> synchronize_rcu() in place of preempt_disable()/preempt_enable() and
> synchronize_sched().  The real-time guys would prefer the change
> to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
> you mention it.
> 
> Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock()
> and synchronize_rcu()?
> 
> 							Thanx, Paul

preempt_disable/preempt_enable is faster than 
rcu_read_lock/rcu_read_unlock for preemptive kernels.

Regarding real-time response - the region blocked with 
preempt_disable/preempt_enable contains a few instructions (one test for 
mutex_is_locked and one increment of percpu variable), so it isn't any 
threat to real time response. There are plenty of longer regions in the 
kernel that are executed with interrupts or preemption disabled.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ