[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20170413165019.GH3956@linux.vnet.ibm.com>
Date: Thu, 13 Apr 2017 09:50:19 -0700
From: "Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: linux-kernel@...r.kernel.org, mingo@...nel.org,
jiangshanlai@...il.com, dipankar@...ibm.com,
akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
josh@...htriplett.org, tglx@...utronix.de, rostedt@...dmis.org,
dhowells@...hat.com, edumazet@...gle.com, fweisbec@...il.com,
oleg@...hat.com, bobby.prani@...il.com
Subject: Re: [PATCH tip/core/rcu 40/40] srcu: Parallelize callback handling
On Thu, Apr 13, 2017 at 11:54:20AM +0200, Peter Zijlstra wrote:
> On Wed, Apr 12, 2017 at 10:40:25AM -0700, Paul E. McKenney wrote:
> > Peter Zijlstra proposed using SRCU to reduce mmap_sem contention [1],
> > however, there are workloads that could result in a high volume of
> > concurrent invocations of call_srcu(), which with current SRCU would
> > result in excessive lock contention on the srcu_struct structure's
> > ->queue_lock, which protects SRCU's callback lists. This commit therefore
> > moves SRCU to per-CPU callback lists, thus greatly reducing contention.
> >
> > Because a given SRCU instance no longer has a single centralized callback
> > list, starting grace periods and invoking callbacks each require a bit
> > more work. These are handled using an srcu_node tree that is in some ways
> > similar to the rcu_node trees used by RCU-bh, RCU-preempt, and RCU-sched
> > (for example, the srcu_node tree shape is controlled by exactly the
> > same Kconfig options and boot parameters that control the shape of the
> > rcu_node tree).
> >
> > In addition, the old per-CPU srcu_array structure is now named srcu_data
> > and contains an rcu_segcblist structure named ->srcu_cblist for its
> > callbacks (and a spinlock to protect this). The srcu_struct gets
> > an srcu_gp_seq that is used to associate callback segments with the
> > corresponding completion-time grace-period number. These completion-time
> > grace-period numbers are propagated up the srcu_node tree so that the
> > grace-period workqueue handler can determine whether additional grace
> > periods are needed on the one hand and where to look for callbacks that
> > are ready to be invoked.
> >
> > The srcu_barrier() function must now wait on all instances of the
> > per-CPU ->srcu_cblist. Because each ->srcu_cblist is protected
> > by ->lock, srcu_barrier() can remotely add the needed callbacks.
> > In theory, it could also remotely start grace periods, but this gets
> > complex and racy. And interestingly enough, it is never necessary to
> > start a grace period in this case because srcu_barrier() only enqueues
> > a callback when a callback is already present. And a grace period has
> > to have already been started for this pre-existing callback. And it is
> > only the callback that srcu_barrier() needs to wait on, not any particular
> > grace period. Therefore, a new rcu_segcblist_entrain() function enqueues
> > the srcu_barrier() function's callback into the same segment occupied by
> > the pre-existing callback. The special case where all the pre-existing
> > callbacks are on a different list being invoked is handled by enqueuing
> > srcu_barrier()'s callback into the RCU_DONE_TAIL segment, relying on
> > the done-callbacks check that takes place after all callbacks are inovked.
> >
> > Note that the readers use the same algorithm as before. Note that there
> > is a separate srcu_idx that tells the readers what counter to increment.
> > This unfortunately cannot be combined with srcu_gp_seq because they
> > need to be incremented at different times.
>
> So one thing I've asked before I think, would it not be possible to
> abstract PREEMPT_RCU and use the exact same code for PREEMPT_RCU and
> SRCU ?
I took a hard look at that some time ago, and it gets pretty ugly
pretty quickly. Much of the PREEMPT_RCU code has the idea that there
is only one global PREEMPT_RCU implementation baked deeply into it.
For but one example, the handling of an arbitrarily large number of
->blkd_tasks lists at context-switch time would not be pretty, especially
if the task in question blocked while in both a PREEMPT_RCU and in an
SRCU read-side critical section. Or, worse yet, if it blocked while in
several different SRCU read-side critical sections.
It might be easier to go the other way and implement PREEMPT_RCU in terms
of SRCU, but I don't believe that the read-side smp_mb() calls would make
people happy. Plus there are use cases that would not be well-served
by idle no longer being an extended quiescent state. And SRCU currently
has inconvenient restrictions about use in interrupt and NMI handlers.
It might well be that there is a global solution for all this, but
in the meantime I am instead sharing common code and doing a bit of
consolidation.
Thanx, Paul
Powered by blists - more mailing lists