linux-kernel - Re: [RFC PATCH 07/13] sched: Add core wide task selection and scheduling.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200416033804.GA5712@HP-G1>
Date:   Thu, 16 Apr 2020 11:39:05 +0800
From:   Chen Yu <yu.chen.surf@...il.com>
To:     vpillai <vpillai@...italocean.com>
Cc:     Nishanth Aravamudan <naravamudan@...italocean.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>, mingo@...nel.org,
        tglx@...utronix.de, pjt@...gle.com, torvalds@...ux-foundation.org,
        linux-kernel@...r.kernel.org, fweisbec@...il.com,
        keescook@...omium.org, kerrnel@...gle.com,
        Phil Auld <pauld@...hat.com>, Aaron Lu <aaron.lwe@...il.com>,
        Aubrey Li <aubrey.intel@...il.com>, aubrey.li@...ux.intel.com,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Joel Fernandes <joelaf@...gle.com>, joel@...lfernandes.org,
        Aaron Lu <aaron.lu@...ux.alibaba.com>,
        Long Cui <long.cui@...el.com>
Subject: Re: [RFC PATCH 07/13] sched: Add core wide task selection and
 scheduling.

On Wed, Mar 04, 2020 at 04:59:57PM +0000, vpillai wrote:
> From: Peter Zijlstra <peterz@...radead.org>
> 
> Instead of only selecting a local task, select a task for all SMT
> siblings for every reschedule on the core (irrespective which logical
> CPU does the reschedule).
> 
> There could be races in core scheduler where a CPU is trying to pick
> a task for its sibling in core scheduler, when that CPU has just been
> offlined.  We should not schedule any tasks on the CPU in this case.
> Return an idle task in pick_next_task for this situation.
> 
> NOTE: there is still potential for siblings rivalry.
> NOTE: this is far too complicated; but thus far I've failed to
>       simplify it further.
> 
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Signed-off-by: Julien Desfossez <jdesfossez@...italocean.com>
> Signed-off-by: Vineeth Remanan Pillai <vpillai@...italocean.com>
> Signed-off-by: Aaron Lu <aaron.lu@...ux.alibaba.com>
> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> ---
[cut]

Hi Vineeth,
An NULL pointer exception was found when testing V5 on top of stable
v5.6.2. And we tried the patch Peter suggested, the NULL pointer
was not found so far. We don't know if this change would help mitigate
the symptom, but it should do no harm to test with this fix applied.

Thanks,
Chenyu

>From 6828eaf4611eeb3e1bad3b9a0d4ec53c6fa01fe3 Mon Sep 17 00:00:00 2001
From: Chen Yu <yu.c.chen@...el.com>
Date: Thu, 16 Apr 2020 10:51:07 +0800
Subject: [PATCH] sched: Fix pick_next_task() race condition in core scheduling

As Perter mentioned that Commit 6e2df0581f56 ("sched: Fix pick_next_task()
vs 'change' pattern race") has fixed a race condition due to rq->lock
improperly released after put_prev_task(), backport this fix to core
scheduling's pick_next_task() as well.

Without this fix, Aubrey, Long and I found an NULL exception point
triggered within one hour when running RDT MBA(Intel Resource Directory
Technolodge Memory Bandwidth Allocation) benchmarks on a 36 Core(72 HTs)
platform, which tries to dereference a NULL sched_entity:

[ 3618.429053] BUG: kernel NULL pointer dereference, address: 0000000000000160
[ 3618.429039] RIP: 0010:pick_task_fair+0x2e/0xa0
[ 3618.429042] RSP: 0018:ffffc90000317da8 EFLAGS: 00010046
[ 3618.429044] RAX: 0000000000000000 RBX: ffff88afdf4ad100 RCX: 0000000000000001
[ 3618.429045] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88afdf4ad100
[ 3618.429045] RBP: ffffc90000317dc0 R08: 0000000000000048 R09: 0100000000100000
[ 3618.429046] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[ 3618.429047] R13: 000000000002d080 R14: ffff88afdf4ad080 R15: 0000000000000014
[ 3618.429048]  ? pick_task_fair+0x48/0xa0
[ 3618.429048]  pick_next_task+0x34c/0x7e0
[ 3618.429049]  ? tick_program_event+0x44/0x70
[ 3618.429049]  __schedule+0xee/0x5d0
[ 3618.429050]  schedule_idle+0x2c/0x40
[ 3618.429051]  do_idle+0x175/0x280
[ 3618.429051]  cpu_startup_entry+0x1d/0x30
[ 3618.429052]  start_secondary+0x169/0x1c0
[ 3618.429052]  secondary_startup_64+0xa4/0xb0

While with this patch applied, no NULL pointer exception was found within
14 hours for now. Although there's no direct evidence this fix would solve
the issue, it does fix a potential race condition.

Signed-off-by: Chen Yu <yu.c.chen@...el.com>
---
 kernel/sched/core.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 02495d44870f..ef101a3ef583 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4477,9 +4477,14 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 		return next;
 	}
 
-	prev->sched_class->put_prev_task(rq, prev);
-	if (!rq->nr_running)
-		newidle_balance(rq, rf);
+
+#ifdef CONFIG_SMP
+	for_class_range(class, prev->sched_class, &idle_sched_class) {
+		if (class->balance(rq, prev, rf))
+			break;
+	}
+#endif
+	put_prev_task(rq, prev);
 
 	smt_mask = cpu_smt_mask(cpu);
 
-- 
2.20.1