linux-kernel - Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions in rcu_read_unlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190401172257.GN4102@linux.ibm.com>
Date:   Mon, 1 Apr 2019 10:22:57 -0700
From:   "Paul E. McKenney" <paulmck@...ux.ibm.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     rcu@...r.kernel.org, linux-kernel@...r.kernel.org,
        mingo@...nel.org, jiangshanlai@...il.com, dipankar@...ibm.com,
        akpm@...ux-foundation.org, mathieu.desnoyers@...icios.com,
        josh@...htriplett.org, tglx@...utronix.de, rostedt@...dmis.org,
        dhowells@...hat.com, edumazet@...gle.com, fweisbec@...il.com,
        oleg@...hat.com, joel@...lfernandes.org,
        Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH tip/core/rcu 2/2] rcu: Check for wakeup-safe conditions
 in rcu_read_unlock_special()

On Mon, Apr 01, 2019 at 10:32:11AM +0200, Peter Zijlstra wrote:
> On Fri, Mar 29, 2019 at 11:26:34AM -0700, Paul E. McKenney wrote:
> > When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
> > kthreads, a full and unconditional wakeup is required to initiate RCU
> > core processing.  In contrast, when RCU core processing is carried
> > out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
> > situations where raise_softirq() does a full wakeup, but these do not
> > occur with normal usage of rcu_read_unlock().
> 
> Do we have a comment somewhere explaining why?

First, thank you for reviewing this!

The "why" is because people normally don't do things like the code
sequence shown below, but where the scheduler holds locks across the
second RCU read-side critical section.  (If they did, lockdep would
complain.  Nevertheless, it is good to avoid this potential problem.)

> > The initial solution to this problem was to use set_tsk_need_resched() and
> > set_preempt_need_resched() to force a future context switch, which allows
> > rcu_preempt_note_context_switch() to report the deferred quiescent state
> > to RCU's core processing.  Unfortunately for expedited grace periods,
> > there can be a significant delay between the call for a context switch
> > and the actual context switch.
> 
> This is all PREEMPT=y kernels, right? Where is the latency coming from?
> Because PREEMPT=y _should_ react quite quickly.

Yes, PREEMPT=y.  It happens like this:

1.	rcu_read_lock() with everything enabled.

2.	Preemption then resumption.

3.	local_irq_disable().

4.	rcu_read_unlock().

5.	local_irq_enable().

>From what I know, the scheduler doesn't see anything until the next
interrupt, local_bh_enable(), return to userspace, etc.  Because this
is PREEMPT=y, preempt_enable() and cond_resched() do nothing.  So
it could be some time (milliseconds, depending on HZ, NO_HZ_FULL, and
so on) until the scheduler responds.  With NO_HZ_FULL, last I knew,
the delay can be extremely long.

Or am I missing something that gets the scheduler on the job faster?

Hmmm...  If your point is that this amount of delay matters only for
expedited grace periods, you are quite right.  So perhaps I shouldn't be
doing any of the expensive stuff unless there is an expedited grace period
in flight.  Or if NO_HZ_FULL.  See below for an updated (and untested)
patch to this effect.

> > This commit therefore introduces a ->deferred_qs flag to the task_struct
> > structure's rcu_special structure.  This flag is initially false, and
> > is set to true by the first call to rcu_read_unlock() requiring special
> > attention, then finally reset back to false when the quiescent state is
> > finally reported.  Then rcu_read_unlock() attempts full wakeups only when
> > ->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
> > special attention.  Note that a chain of RCU readers linked by some other
> > sort of reader may find that a later rcu_read_unlock() is once again able
> > to do a full wakeup, courtesy of an intervening preemption:
> > 
> > 	rcu_read_lock();
> > 	/* preempted */
> > 	local_irq_disable();
> > 	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
> > 	rcu_read_lock();
> > 	local_irq_enable();
> > 	preempt_disable()
> > 	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
> > 	rcu_read_lock();
> > 	preempt_enable();
> > 	/* preempted, >deferred_qs reset. */
> 
> As it would have without ->deferred_sq and just having done the above
> which was deemed insufficient.
> 
> So I'm really puzzled by the need for all this.

On the first round, without the ->deferred_qs, we know the scheduler
cannot be holding any of its pi or rq locks because if it did, it would
have disabled interrupts across the entire RCU read-side critical section.
Wakeups are therefore safe in this case, whether via softirq or wakeup.
Afterwards, we don't have that guarantee because an earlier critical
section might have been preempted and the scheduler might have held one
of its locks across the entire just-ended critical section.

And I believe you are right that we should avoid the wakeups unless
there is an expedited grace period in flight or in a NO_HZ_FULL kernel.
Hence the patch shown below.

							Thanx, Paul

-----------------------------------------------------------------------

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 2e52a77af6be..582c6d88aaa0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -606,20 +606,26 @@ static void rcu_read_unlock_special(struct task_struct *t)
 	local_irq_save(flags);
 	irqs_were_disabled = irqs_disabled_flags(flags);
 	if (preempt_bh_were_disabled || irqs_were_disabled) {
+		bool exp;
+
 		t->rcu_read_unlock_special.b.exp_hint = false;
+		exp = !!READ_ONCE(this_cpu_ptr(&rcu_data)->mynode->exp_tasks);
 		// Need to defer quiescent state until everything is enabled.
-		if (irqs_were_disabled && use_softirq &&
+		if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
+		    irqs_were_disabled && use_softirq &&
 		    (in_irq() || !t->rcu_read_unlock_special.b.deferred_qs)) {
 			// Using softirq, safe to awaken, and we get
 			// no help from enabling irqs, unlike bh/preempt.
 			raise_softirq_irqoff(RCU_SOFTIRQ);
-		} else if (irqs_were_disabled && !use_softirq &&
+		} else if ((exp || IS_ENABLED(CONFIG_NO_HZ_FULL)) &&
+			   irqs_were_disabled && !use_softirq &&
 			   !t->rcu_read_unlock_special.b.deferred_qs) {
 			// Safe to awaken and we get no help from enabling
 			// irqs, unlike bh/preempt.
 			invoke_rcu_core();
 		} else {
 			// Enabling BH or preempt does reschedule, so...
+			// Also if no expediting or NO_HZ_FULL, slow is OK.
 			set_tsk_need_resched(current);
 			set_preempt_need_resched();
 		}