linux-kernel - Re: [PATCH] sched: Use RCU-sched in core-scheduling balancing logic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200324151223.GA19722@paulmck-ThinkPad-P72>
Date:   Tue, 24 Mar 2020 08:12:23 -0700
From:   "Paul E. McKenney" <paulmck@...nel.org>
To:     "Li, Aubrey" <aubrey.li@...ux.intel.com>
Cc:     Joel Fernandes <joel@...lfernandes.org>,
        linux-kernel@...r.kernel.org, vpillai <vpillai@...italocean.com>,
        Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>, peterz@...radead.org,
        Ben Segall <bsegall@...gle.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Juri Lelli <juri.lelli@...hat.com>,
        Mel Gorman <mgorman@...e.de>,
        Steven Rostedt <rostedt@...dmis.org>,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [PATCH] sched: Use RCU-sched in core-scheduling balancing logic

On Tue, Mar 24, 2020 at 06:30:12AM -0700, Paul E. McKenney wrote:
> On Tue, Mar 24, 2020 at 11:01:27AM +0800, Li, Aubrey wrote:
> > On 2020/3/23 23:21, Joel Fernandes wrote:
> > > On Mon, Mar 23, 2020 at 02:58:18PM +0800, Li, Aubrey wrote:
> > >> On 2020/3/14 8:30, Paul E. McKenney wrote:
> > >>> On Fri, Mar 13, 2020 at 07:29:18PM -0400, Joel Fernandes (Google) wrote:
> > >>>> rcu_read_unlock() can incur an infrequent deadlock in
> > >>>> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
> > >>>>
> > >>>> This fixes the following spinlock recursion observed when testing the
> > >>>> core scheduling patches on PREEMPT=y kernel on ChromeOS:
> > >>>>
> > >>>> [   14.998590] watchdog: BUG: soft lockup - CPU#0 stuck for 11s! [kworker/0:10:965]
> > >>>>
> > >>>
> > >>> The original could indeed deadlock, and this would avoid that deadlock.
> > >>> (The commit to solve this deadlock is sadly not yet in mainline.)
> > >>>
> > >>> Acked-by: Paul E. McKenney <paulmck@...nel.org>
> > >>
> > >> I saw this in dmesg with this patch, is it expected?
> > >>
> > >> [  117.000905] =============================
> > >> [  117.000907] WARNING: suspicious RCU usage
> > >> [  117.000911] 5.5.7+ #160 Not tainted
> > >> [  117.000913] -----------------------------
> > >> [  117.000916] kernel/sched/core.c:4747 suspicious rcu_dereference_check() usage!
> > >> [  117.000918] 
> > >>                other info that might help us debug this:
> > > 
> > > Sigh, this is because for_each_domain() expects rcu_read_lock(). From an RCU
> > > PoV, the code is correct (warning doesn't cause any issue).
> > > 
> > > To silence warning, we could replace the rcu_read_lock_sched() in my patch with:
> > > preempt_disable();
> > > rcu_read_lock();
> > > 
> > > and replace the unlock with:
> > > 
> > > rcu_read_unlock();
> > > preempt_enable();
> > > 
> > > That should both take care of both the warning and the scheduler-related
> > > deadlock. Thoughts?
> > > 
> > 
> > How about this?
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a01df3e..7ff694e 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4743,7 +4743,6 @@ static void sched_core_balance(struct rq *rq)
> >  	int cpu = cpu_of(rq);
> >  
> >  	rcu_read_lock();
> > -	raw_spin_unlock_irq(rq_lockp(rq));
> >  	for_each_domain(cpu, sd) {
> >  		if (!(sd->flags & SD_LOAD_BALANCE))
> >  			break;
> > @@ -4754,7 +4753,6 @@ static void sched_core_balance(struct rq *rq)
> >  		if (steal_cookie_task(cpu, sd))
> >  			break;
> >  	}
> > -	raw_spin_lock_irq(rq_lockp(rq));
> >  	rcu_read_unlock();
> >  }
> 
> As an alternative, I am backporting the -rcu commit 2b5e19e20fc2 ("rcu:
> Make rcu_read_unlock_special() safe for rq/pi locks") to v5.6-rc6 and
> testing it.  The other support for this is already in mainline.  I just
> now started testing it, and will let you know how it goes.
> 
> If that works for you, and if the bug you are looking to fix is also
> in v5.5 or earlier, please let me know so that we can work out how to
> deal with the older releases.

And the backported patch below passes moderate rcutorture testing.
But the big question...  Does it work for you?

							Thanx, Paul

------------------------------------------------------------------------

commit 0ed0648e23f6aca2a6543fee011d74ab674e08e8
Author: Paul E. McKenney <paulmck@...nel.org>
Date:   Tue Mar 24 06:20:14 2020 -0700

    rcu: Make rcu_read_unlock_special() safe for rq/pi locks
    
    The scheduler is currently required to hold rq/pi locks across the entire
    RCU read-side critical section or not at all.  This is inconvenient and
    leaves traps for the unwary, including the author of this commit.
    
    But now that excessively ong grace periods enable scheduling-clock
    interrupts for holdout nohz_full CPUs, the nohz_full rescue logic in
    rcu_read_unlock_special() can be dispensed with.  In other words, the
    rcu_read_unlock_special() function can refrain from doing wakeups unless
    such wakeups are guaranteed safe.
    
    This commit therefore avoids unsafe wakeups, freeing the scheduler to
    hold rq/pi locks across rcu_read_unlock() even if the corresponding RCU
    read-side critical section might have been preempted.
    
    This commit is inspired by a patch from Lai Jiangshan:
    https://lore.kernel.org/lkml/20191102124559.1135-2-laijs@linux.alibaba.com
    This commit is further intended to be a step towards his goal of permitting
    the inlining of RCU-preempt's rcu_read_lock() and rcu_read_unlock().
    
    Cc: Lai Jiangshan <laijs@...ux.alibaba.com>
    [ paulmck: Backported to v5.6-rc6. ]
    Signed-off-by: Paul E. McKenney <paulmck@...nel.org>

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index c6ea81cd4189..2e1bfa0cc393 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -614,18 +614,16 @@ static void rcu_read_unlock_special(struct task_struct *t)
 		struct rcu_node *rnp = rdp->mynode;
 
 		exp = (t->rcu_blocked_node && t->rcu_blocked_node->exp_tasks) ||
-		      (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
-		      tick_nohz_full_cpu(rdp->cpu);
+		      (rdp->grpmask & READ_ONCE(rnp->expmask));
 		// Need to defer quiescent state until everything is enabled.
-		if (irqs_were_disabled && use_softirq &&
-		    (in_interrupt() ||
-		     (exp && !t->rcu_read_unlock_special.b.deferred_qs))) {
-			// Using softirq, safe to awaken, and we get
-			// no help from enabling irqs, unlike bh/preempt.
+		if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
+			// Using softirq, safe to awaken, and either the
+			// wakeup is free or there is an expedited GP.
 			raise_softirq_irqoff(RCU_SOFTIRQ);
 		} else {
 			// Enabling BH or preempt does reschedule, so...
-			// Also if no expediting or NO_HZ_FULL, slow is OK.
+			// Also if no expediting, slow is OK.
+			// Plus nohz_full CPUs eventually get tick enabled.
 			set_tsk_need_resched(current);
 			set_preempt_need_resched();
 			if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&