linux-kernel - Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and scheduling.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200901033842.GA1952605@google.com>
Date:   Mon, 31 Aug 2020 23:38:42 -0400
From:   Joel Fernandes <joel@...lfernandes.org>
To:     peterz@...radead.org
Cc:     Vineeth Pillai <viremana@...ux.microsoft.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Tim Chen <tim.c.chen@...ux.intel.com>,
        Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>,
        Dhaval Giani <dhaval.giani@...cle.com>,
        Chris Hyser <chris.hyser@...cle.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        mingo@...nel.org, tglx@...utronix.de, pjt@...gle.com,
        torvalds@...ux-foundation.org, linux-kernel@...r.kernel.org,
        fweisbec@...il.com, keescook@...omium.org, kerrnel@...gle.com,
        Phil Auld <pauld@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>, vineeth@...byteword.org,
        Chen Yu <yu.c.chen@...el.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        Agata Gruza <agata.gruza@...el.com>,
        Antonio Gomez Iglesias <antonio.gomez.iglesias@...el.com>,
        graf@...zon.com, konrad.wilk@...cle.com, dfaggioli@...e.com,
        rostedt@...dmis.org, derkling@...gle.com, benbjiang@...cent.com,
        Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Aaron Lu <aaron.lu@...ux.alibaba.com>
Subject: Re: [RFC PATCH v7 08/23] sched: Add core wide task selection and
 scheduling.

On Sat, Aug 29, 2020 at 09:47:19AM +0200, peterz@...radead.org wrote:
> On Fri, Aug 28, 2020 at 06:02:25PM -0400, Vineeth Pillai wrote:
> > On 8/28/20 4:51 PM, Peter Zijlstra wrote:
> 
> > > So where do things go side-ways?
> 
> > During hotplug stress test, we have noticed that while a sibling is in
> > pick_next_task, another sibling can go offline or come online. What
> > we have observed is smt_mask get updated underneath us even if
> > we hold the lock. From reading the code, looks like we don't hold the
> > rq lock when the mask is updated. This extra logic was to take care of that.
> 
> Sure, the mask is updated async, but _where_ is the actual problem with
> that?
> 
> On Fri, Aug 28, 2020 at 06:23:55PM -0400, Joel Fernandes wrote:
> > Thanks Vineeth. Peter, also the "v6+" series (which were some addons on v6)
> > detail the individual hotplug changes squashed into this patch:
> > https://lore.kernel.org/lkml/20200815031908.1015049-9-joel@joelfernandes.org/
> > https://lore.kernel.org/lkml/20200815031908.1015049-11-joel@joelfernandes.org/
> 
> That one looks fishy, the pick is core wide, making that pick_seq per rq
> just doesn't make sense.
> 
> > https://lore.kernel.org/lkml/20200815031908.1015049-12-joel@joelfernandes.org/
> 
> This one reads like tinkering, there is no description of the actual
> problem just some code that makes a symptom go away.
> 
> Sure, on hotplug the smt mask can change, but only for a CPU that isn't
> actually scheduling, so who cares.
> 
> /me re-reads the hotplug code...
> 
> ..ooOO is the problem that we clear the cpumasks on take_cpu_down()
> instead of play_dead() ?! That should be fixable.

That is indeed the problem.

I was wondering, is there any harm in just selecting idle task if the CPU
calling schedule() is missing from cpu_smt_mask? Does it need to do a
core-wide selection?

That would be best, and avoid any unnecessary surgery of the already
complicated function. This is sort of what Tim was doing in v4 and v5.

Also what do we do if cpu_smt_mask changing while this function is running? I
tried something like the following and it solves the issues but the overhead
probably really sucks. I was thinking of doing a variation of the below that
just stored the cpu_smt_mask's rq pointers in an array of size SMTS_PER_CORE
on the stack, instead of a cpumask but I am not sure if that will keep the
code clean while still having similar storage overhead.

---8<-----------------------

>From 5e905e7e620177075a9bcf78fb0dc89a136434bb Mon Sep 17 00:00:00 2001
From: Joel Fernandes <joelaf@...gle.com>
Date: Tue, 30 Jun 2020 19:39:45 -0400
Subject: [PATCH] Fix CPU hotplug causing crashes in task selection logic

Signed-off-by: Joel Fernandes <joelaf@...gle.com>
---
 kernel/sched/core.c | 34 ++++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0362102fa3d2..47a21013ba0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4464,7 +4464,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 {
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
-	const struct cpumask *smt_mask;
+	struct cpumask select_mask;
 	int i, j, cpu, occ = 0;
 	bool need_sync;
 
@@ -4499,7 +4499,13 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	finish_prev_task(rq, prev, rf);
 
 	cpu = cpu_of(rq);
-	smt_mask = cpu_smt_mask(cpu);
+	cpumask_copy(&select_mask, cpu_smt_mask(cpu));
+
+	/*
+	 * Always make sure current CPU is added to smt_mask so that below
+	 * selection logic runs on it.
+	 */
+	cpumask_set_cpu(cpu, &select_mask);
 
 	/*
 	 * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq
@@ -4516,7 +4522,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 	/* reset state */
 	rq->core->core_cookie = 0UL;
-	for_each_cpu(i, smt_mask) {
+	for_each_cpu(i, &select_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
 		rq_i->core_pick = NULL;
@@ -4536,7 +4542,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 */
 	for_each_class(class) {
 again:
-		for_each_cpu_wrap(i, smt_mask, cpu) {
+		for_each_cpu_wrap(i, &select_mask, cpu) {
 			struct rq *rq_i = cpu_rq(i);
 			struct task_struct *p;
 
@@ -4600,7 +4608,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				trace_printk("max: %s/%d %lx\n", max->comm, max->pid, max->core_cookie);
 
 				if (old_max) {
-					for_each_cpu(j, smt_mask) {
+					for_each_cpu(j, &select_mask) {
 						if (j == i)
 							continue;
 
@@ -4625,6 +4633,10 @@ next_class:;
 
 	rq->core->core_pick_seq = rq->core->core_task_seq;
 	next = rq->core_pick;
+
+	/* Something should have been selected for current CPU*/
+	WARN_ON_ONCE(!next);
+
 	rq->core_sched_seq = rq->core->core_pick_seq;
 	trace_printk("picked: %s/%d %lx\n", next->comm, next->pid, next->core_cookie);
 
@@ -4636,7 +4648,7 @@ next_class:;
 	 * their task. This ensures there is no inter-sibling overlap between
 	 * non-matching user state.
 	 */
-	for_each_cpu(i, smt_mask) {
+	for_each_cpu(i, &select_mask) {
 		struct rq *rq_i = cpu_rq(i);
 
 		WARN_ON_ONCE(!rq_i->core_pick);
-- 
2.28.0.402.g5ffc5be6b7-goog