[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200526210947.GP2869@paulmck-ThinkPad-P72>
Date:   Tue, 26 May 2020 14:09:47 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     Joel Fernandes <joel@...lfernandes.org>
Cc:     Frederic Weisbecker <frederic@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Steven Rostedt <rostedt@...dmis.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Lai Jiangshan <jiangshanlai@...il.com>,
        Josh Triplett <josh@...htriplett.org>
Subject: Re: [PATCH 01/10] rcu: Directly lock rdp->nocb_lock on nocb code
 entrypoints
On Tue, May 26, 2020 at 04:18:40PM -0400, Joel Fernandes wrote:
> On Tue, May 26, 2020 at 09:29:46AM -0700, Paul E. McKenney wrote:
> > On Tue, May 26, 2020 at 11:21:37AM -0400, Joel Fernandes wrote:
> > > Hi Paul,
> > > 
> > > On Fri, May 22, 2020 at 10:57:39AM -0700, Paul E. McKenney wrote:
> > > > On Wed, May 20, 2020 at 08:29:49AM -0400, Joel Fernandes wrote:
> > > > > On Wed, May 13, 2020 at 06:47:05PM +0200, Frederic Weisbecker wrote:
> > > > > > Pure NOCB code entrypoints (nocb_cb kthread, nocb_gp kthread, nocb
> > > > > > timers) can unconditionally lock rdp->nocb_lock as they always execute
> > > > > > in the context of an offloaded rdp.
> > > > > > 
> > > > > > This also prepare for toggling CPUs to/from callback's offloaded mode
> > > > > > where the offloaded state will possibly change when rdp->nocb_lock
> > > > > > isn't taken. We'll still want the entrypoints to lock the rdp in any
> > > > > > case.
> > > > > 
> > > > > Suggested rewrite for change log:
> > > > > <wordsmithing>
> > > > > Make pure NOCB code entrypoints (nocb_cb kthread, nocb_gp kthread, nocb
> > > > > timers) unconditionally lock rdp->nocb_lock as they always execute in the
> > > > > context of an offloaded rdp.
> > > > > 
> > > > > This prepares for future toggling of CPUs to/from callback's offloaded mode
> > > > > where the offloaded state can change when rdp->nocb_lock is not held. We'll
> > > > > still want the entrypoints to lock the rdp in any case.
> > > > > </wordsmithing>
> > > > > 
> > > > > Also, can we inline rcu_nocb_lock_irqsave() into
> > > > > do_nocb_deferred_wakeup_common() since that's the only user, and then delete
> > > > > rcu_nocb_lock_irqsave() and the corresponding unlock? That would also remove
> > > > > confusion about which API to use for nocb locking (i.e. whether to directly
> > > > > acquire lock or call rcu_nocb_lock_irqsave()).
> > > > > 
> > > > > Reviewed-by: Joel Fernandes (Google) <joel@...lfernandes.org>
> > > > 
> > > > Thank you for looking this over, Joel!
> > > > 
> > > > Is it feasible to make rcu_nocb_lock*() and rcu_nocb_unlock*() "do the
> > > > right thing", even when things are changing?
> > > 
> > > One way to prevent things from changing could be to employ Steven's
> > > poor-man's RCU to mediate the "changing state" itself assuming the switch
> > > from the fast-path to the slow-path does not happen often. :) So whichever
> > > path is changing the state needs to wait for that poor-man's GP.
> > 
> > That should not be needed, given that acquiring ->nocb_lock on the CPU
> > corresponding to that lock suffices in both cases.  The trick is making
> > sure that the release matches the acquire.
> 
> Revisiting what you meant by "when things are changing", I'm assuming you
> meant a CPU is dynamically switched from non-offloaded mode to
> offloaded-mode. Please correct me if I'm wrong.
Both.  It does no good to acquire by disabling IRQs and then release by
releasing a lock that you do not hold.  Nor does it help to acquire the
lock and then fail to release it, instead merely re-enabling IRQs.
Aside from the case where two locks are held, a per-CPU variable (as
in another field in the rcu_data structure would work.  And that case
could probably be reworked to hold only one lock at a time.
> Assuming that's true, you asked how do we "do the right thing" in the
> lock/unlock APIs. I was also suggesting getting rid of them and directly
> acquiring/releasing the spinlock, like Frederic does. It sounds like
> that is not good enough and you want an API that can do conditional locking
> (and the said complexity is hidden behind the API). Allow me to read more
> code and see if I can understand that / how to do that.
Indeed, the non-nohz_full code should not acquire the spinlock.
> > > Any case, are you concerned with the performance issues with the
> > > unconditional locking and that's why you suggest still keeping it conditional?
> > 
> > My concerns are more about maintainability.
> 
> Ok.
> 
> > > Also, coming back to my point of inline the helper into the last caller -
> > > do_nocb_deferred_wakeup_common(). I think since we don't need to check if the
> > > rdp is offloaded and do any conditional locking. The timer can be called only
> > > with offloaded rdp. So we can directly do the unconditional spinlock instead
> > > of using the rcu_nocb_lock helper there.
> > 
> > Indeed we can.  But should we?
> 
> Yeah may be not, in the event that we could do conditional locking and
> benefit.
Underneath an API, as current mainline does.
> > > > would prevent any number of "interesting" copy-pasta and "just now became
> > > > common code" bugs down the road.  And because irqs are disabled while
> > > > holding the lock, it should be possible to keep state on a per-CPU basis.
> > > 
> > > Agreed, that would be nice. Though if we could keep simple, that'd be nice
> > > too.
> > 
> > Having one set of functions/macros that are always used to protect
> > the ->cblist, no matter what the context, is a very attractive form of
> > simple.  ;-)
> 
> I was thinking that API is already raw_spin_lock/unlock() but I'll revisit
> everything.
It is the function/macro family that includes rcu_nocb_lock() and
rcu_nocb_unlock(), as you can see from Frederic's first patch.
> > > > The ugliest scenario is callback adoption, where there are two ->cblist
> > > > structures in need of being locked.  In that case, changes are excluded
> > > > (because that is in CPU hotplug code), but is it possible to take
> > > > advantage of that reasonably?
> > > 
> > > Could you describe this a bit more? Thanks.
> > 
> > Right now, callbacks are merged directly from the outgoing CPU's ->cblist
> > to the surviving CPU's ->cblist.  This means that both ->cblist structures
> > must be locked at the same time, which would require additional state.
> > After all, if only the one ->cblist were to be locked at a given time,
> > a per-CPU variable could be used to track what method should be used to
> > unlock the ->cblist.
> 
> So you do mean conditional locking behind an API, and everyone call the API
> whether they want do the conditional locking or not. Ok.
Yes, as the code in mainline is now.
> > This could be restructured to use an intermediate private ->cblist,
> > but care would be required with the counts, as it is a very bad idea
> > for a large group of callbacks to become invisible.  (This is not a
> > problem for rcu_barrier(), which excludes CPU hotplug, but it could
> > be a serious problem for callback-flood-mitigation mechanisms.  Yes,
> > they are heuristic, but...)
> 
> Ok.
> 
> > > > Maybe these changes are the best we can do, but it would be good to
> > > > if the same primitive locked a ->cblist regardless of context.
> > > 
> > > Here you are comparing 2 primitives. Do you mean that just IRQs being
> > > disabled is another primitive, and rcu_nocb_lock is another one?
> > 
> > I am not sure what this question means, but I am advocating looking
> > into retaining a single wrapper that decides instead of direct use of
> > the underlying primitives.
> 
> Yep.
> 
> > Or are you instead asking why there are two different methods of
> > protecting the ->cblist structures?  (If so, because call_rcu() happens
> > often enough that we don't want lock-acquisition overhead unless we
> > absolutely need it, which we do on nohz_full CPUs but not otherwise.)
> 
> Yeah that's what I was asking. About lock-acquisition overhead, I think its
> still uncontended overhead though because even if the nocb lock is taken when
> it was not needed, it is still to lock the local ->cblist. Correct me if I'm
> wrong though!
It is uncontended, but it is unnecessary overhead.  And call_rcu() can
be invoked rather frequently.
> > > BTW, I'm really itching to give it a try to make the scheduler more deadlock
> > > resilient (that is, if the scheduler wake up path detects a deadlock, then it
> > > defers the wake up using timers, or irq_work on its own instead of passing
> > > the burden of doing so to the callers). Thoughts?
> > 
> > I have used similar approaches within RCU, but on the other hand the
> > scheduler often has tighter latency constraints than RCU does.	So I
> > think that is a better question for the scheduler maintainers than it
> > is for me.  ;-)
> 
> Ok, it definitely keeps coming up in my radar first with the
> rcu_read_unlock_special() stuff, and now the nocb ;-). Perhaps it could also
> be good for a conference discussion!
Again, please understand that RCU has way looser latency constraints
than the scheduler does.  Adding half a jiffy to wakeup latency might
not go over well, especially in the real-time application area.
But what did the scheduler maintainers say about this idea?
							Thanx, Paul
> thanks,
> 
>  - Joel
> 
> > 							Thanx, Paul
> > 
> > > thanks,
> > > 
> > >  - Joel
> > > 
> > > 
> > > > Can that be made to work reasonably?
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > thanks,
> > > > > 
> > > > >  - Joel
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Signed-off-by: Frederic Weisbecker <frederic@...nel.org>
> > > > > > Cc: Paul E. McKenney <paulmck@...nel.org>
> > > > > > Cc: Josh Triplett <josh@...htriplett.org>
> > > > > > Cc: Steven Rostedt <rostedt@...dmis.org>
> > > > > > Cc: Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
> > > > > > Cc: Lai Jiangshan <jiangshanlai@...il.com>
> > > > > > Cc: Joel Fernandes <joel@...lfernandes.org>
> > > > > > ---
> > > > > >  kernel/rcu/tree_plugin.h | 14 +++++++-------
> > > > > >  1 file changed, 7 insertions(+), 7 deletions(-)
> > > > > > 
> > > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > > index 097635c41135..523570469864 100644
> > > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > > @@ -1909,7 +1909,7 @@ static void do_nocb_bypass_wakeup_timer(struct timer_list *t)
> > > > > >  	struct rcu_data *rdp = from_timer(rdp, t, nocb_bypass_timer);
> > > > > >  
> > > > > >  	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Timer"));
> > > > > > -	rcu_nocb_lock_irqsave(rdp, flags);
> > > > > > +	raw_spin_lock_irqsave(&rdp->nocb_lock, flags);
> > > > > >  	smp_mb__after_spinlock(); /* Timer expire before wakeup. */
> > > > > >  	__call_rcu_nocb_wake(rdp, true, flags);
> > > > > >  }
> > > > > > @@ -1942,7 +1942,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> > > > > >  	 */
> > > > > >  	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_cb_rdp) {
> > > > > >  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> > > > > > -		rcu_nocb_lock_irqsave(rdp, flags);
> > > > > > +		raw_spin_lock_irqsave(&rdp->nocb_lock, flags);
> > > > > >  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > > > > >  		if (bypass_ncbs &&
> > > > > >  		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
> > > > > > @@ -1951,7 +1951,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> > > > > >  			(void)rcu_nocb_try_flush_bypass(rdp, j);
> > > > > >  			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > > > > >  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
> > > > > > -			rcu_nocb_unlock_irqrestore(rdp, flags);
> > > > > > +			raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags);
> > > > > >  			continue; /* No callbacks here, try next. */
> > > > > >  		}
> > > > > >  		if (bypass_ncbs) {
> > > > > > @@ -1996,7 +1996,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> > > > > >  		} else {
> > > > > >  			needwake = false;
> > > > > >  		}
> > > > > > -		rcu_nocb_unlock_irqrestore(rdp, flags);
> > > > > > +		raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags);
> > > > > >  		if (needwake) {
> > > > > >  			swake_up_one(&rdp->nocb_cb_wq);
> > > > > >  			gotcbs = true;
> > > > > > @@ -2084,7 +2084,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
> > > > > >  	rcu_do_batch(rdp);
> > > > > >  	local_bh_enable();
> > > > > >  	lockdep_assert_irqs_enabled();
> > > > > > -	rcu_nocb_lock_irqsave(rdp, flags);
> > > > > > +	raw_spin_lock_irqsave(&rdp->nocb_lock, flags);
> > > > > >  	if (rcu_segcblist_nextgp(&rdp->cblist, &cur_gp_seq) &&
> > > > > >  	    rcu_seq_done(&rnp->gp_seq, cur_gp_seq) &&
> > > > > >  	    raw_spin_trylock_rcu_node(rnp)) { /* irqs already disabled. */
> > > > > > @@ -2092,7 +2092,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
> > > > > >  		raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
> > > > > >  	}
> > > > > >  	if (rcu_segcblist_ready_cbs(&rdp->cblist)) {
> > > > > > -		rcu_nocb_unlock_irqrestore(rdp, flags);
> > > > > > +		raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags);
> > > > > >  		if (needwake_gp)
> > > > > >  			rcu_gp_kthread_wake();
> > > > > >  		return;
> > > > > > @@ -2100,7 +2100,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
> > > > > >  
> > > > > >  	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep"));
> > > > > >  	WRITE_ONCE(rdp->nocb_cb_sleep, true);
> > > > > > -	rcu_nocb_unlock_irqrestore(rdp, flags);
> > > > > > +	raw_spin_unlock_irqrestore(&rdp->nocb_lock, flags);
> > > > > >  	if (needwake_gp)
> > > > > >  		rcu_gp_kthread_wake();
> > > > > >  	swait_event_interruptible_exclusive(rdp->nocb_cb_wq,
> > > > > > -- 
> > > > > > 2.25.0
> > > > > > 
Powered by blists - more mailing lists
 
