linux-kernel - Re: [RFC PATCH 06/16] sched: Add core wide task selection and scheduling.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200701232847.GA439212@google.com>
Date:   Wed, 1 Jul 2020 19:28:47 -0400
From:   Joel Fernandes <joel@...lfernandes.org>
To:     Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>
Cc:     Nishanth Aravamudan <naravamudan@...italocean.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        tglx@...utronix.de, pjt@...gle.com, torvalds@...ux-foundation.org,
        linux-kernel@...r.kernel.org, subhra.mazumdar@...cle.com,
        mingo@...nel.org, fweisbec@...il.com, keescook@...omium.org,
        kerrnel@...gle.com, Phil Auld <pauld@...hat.com>,
        Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>, vineethrp@...il.com,
        Chen Yu <yu.c.chen@...el.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Aaron Lu <aaron.lu@...ux.alibaba.com>, paulmck@...nel.org
Subject: Re: [RFC PATCH 06/16] sched: Add core wide task selection and
 scheduling.

On Tue, Jun 30, 2020 at 09:32:27PM +0000, Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra <peterz@...radead.org>
> 
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
> 
> There could be races in core scheduler where a CPU is trying to pick
> a task for its sibling in core scheduler, when that CPU has just been
> offlined.  We should not schedule any tasks on the CPU in this case.
> Return an idle task in pick_next_task for this situation.
> 
> NOTE: there is still potential for siblings rivalry.
> NOTE: this is far too complicated; but thus far I've failed to
>       simplify it further.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Signed-off-by: Julien Desfossez <jdesfossez@...italocean.com>
> Signed-off-by: Vineeth Remanan Pillai <vpillai@...italocean.com>
> Signed-off-by: Aaron Lu <aaron.lu@...ux.alibaba.com>
> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>

Hi Peter, Tim, all, the below patch fixes the hotplug issue described in the
below patch's Link tag. Patch description below describes the issues fixed
and it applies on top of this patch.

------8<----------

From: "Joel Fernandes (Google)" <joel@...lfernandes.org>
Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection logic

The selection logic does not run correctly if the current CPU is not in the
cpu_smt_mask (which it is not because the CPU is offlined when the stopper
finishes running and needs to switch to idle).  There are also other issues
fixed by the patch I think such as: if some other sibling set core_pick to
something, however the selection logic on current cpu resets it before
selecting. In this case, we need to run the task selection logic again to
make sure it picks something if there is something to run. It might end up
picking the wrong task.  Yet another issue was, if the stopper thread is an
unconstrained pick, then rq->core_pick is set. The next time task selection
logic runs when stopper needs to switch to idle, the current CPU is not in
the smt_mask. This causes the previous ->core_pick to be picked again which
happens to be the unconstrained task! so the stopper keeps getting selected
forever.

That and there are a few more safe guards and checks around checking/setting
rq->core_pick. To test it, I ran rcutorture and made it tag all torture
threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the
issue. Now it runs for an hour or so without issue. (Torture testing debug
changes: https://bit.ly/38htfqK ).

Various fixes were tried causing varying degrees of crashes.  Finally I found
that it is easiest to just add current CPU to the smt_mask's copy always.
This is so that task selection logic always runs on the current CPU which
called schedule().

Link: lore.kernel.org/r/20200414133559.GT20730@...ez.programming.kicks-ass.net
Reported-by: Tim Chen <tim.c.chen@...ux.intel.com>
Reported-by: Vineeth Pillai <vpillai@...italocean.com>
Signed-off-by: Joel Fernandes (Google) <joel@...lfernandes.org>
---
 kernel/sched/core.c | 35 +++++++++++++++++++++++++++++------
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ede86fb37b4e8..a5604aa292e66 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4307,7 +4307,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
-	const struct cpumask *smt_mask;
+	struct cpumask select_mask;
 	int i, j, cpu, occ = 0;
 	bool need_sync;
 
@@ -4334,7 +4334,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	finish_prev_task(rq, prev, rf);
 
 	cpu = cpu_of(rq);
-	smt_mask = cpu_smt_mask(cpu);
+	/* Make a copy of cpu_smt_mask as we should not set that. */
+	cpumask_copy(&select_mask, cpu_smt_mask(cpu));
+
+	/*
+	 * Always make sure current CPU is added to smt_mask so that below
+	 * selection logic runs on it.
+	 */
+	cpumask_set_cpu(cpu, &select_mask);
 
 	/*
 	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
@@ -4351,7 +4358,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	/* reset state */
 	rq->core->core_cookie = 0UL;
-	for_each_cpu(i, smt_mask) {
+	for_each_cpu(i, &select_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
 		rq_i->core_pick = NULL;
@@ -4371,7 +4378,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 */
 	for_each_class(class) {
 again:
-		for_each_cpu_wrap(i, smt_mask, cpu) {
+		for_each_cpu_wrap(i, &select_mask, cpu) {
 			struct rq *rq_i = cpu_rq(i);
 			struct task_struct *p;
 
@@ -4402,6 +4409,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 */
 			if (i == cpu && !need_sync && !p->core_cookie) {
 				next = p;
+				rq_i->core_pick = next;
+				rq_i->core_sched_seq = rq_i->core->core_pick_seq;
 				goto done;
 			}
 
@@ -4427,7 +4436,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				max = p;
 
 				if (old_max) {
-					for_each_cpu(j, smt_mask) {
+					for_each_cpu(j, &select_mask) {
 						if (j == i)
 							continue;
 
@@ -4452,6 +4461,10 @@ next_class:;
 
 	rq->core->core_pick_seq = rq->core->core_task_seq;
 	next = rq->core_pick;
+
+	/* Something should have been selected for current CPU */
+	WARN_ON_ONCE(!next);
+
 	rq->core_sched_seq = rq->core->core_pick_seq;
 
 	/*
@@ -4462,7 +4475,7 @@ next_class:;
 	 * their task. This ensures there is no inter-sibling overlap between
 	 * non-matching user state.
 	 */
-	for_each_cpu(i, smt_mask) {
+	for_each_cpu(i, &select_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
 		WARN_ON_ONCE(!rq_i->core_pick);
@@ -4483,6 +4496,16 @@ next_class:;
 	}
 
 done:
+	/*
+	 * If we reset a sibling's core_pick, make sure that we picked a task
+	 * for it, this is because we might have reset it though it was set to
+	 * something by another selector. In this case we cannot leave it as
+	 * NULL and should have found something for it.
+	 */
+	for_each_cpu(i, &select_mask) {
+		WARN_ON_ONCE(!cpu_rq(i)->core_pick);
+	}
+
 	set_next_task(rq, next);
 	return next;
 }
-- 
2.27.0.212.ge8ba1cc988-goog