linux-kernel - Re: [RFC PATCH v2 00/17] Core scheduling v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20190515213615.3387-1-vpillai@digitalocean.com>
Date:   Wed, 15 May 2019 21:36:15 +0000
From:   Vineeth Remanan Pillai <vpillai@...italocean.com>
To:     Aaron Lu <aaron.lu@...ux.alibaba.com>
Cc:     Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Julien Desfossez <jdesfossez@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Tim Chen <tim.c.chen@...ux.intel.com>, mingo@...nel.org,
        tglx@...utronix.de, pjt@...gle.com, torvalds@...ux-foundation.org,
        linux-kernel@...r.kernel.org, subhra.mazumdar@...cle.com,
        fweisbec@...il.com, keescook@...omium.org, kerrnel@...gle.com,
        Aubrey Li <aubrey.intel@...il.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Phil Auld <pauld@...hat.com>
Subject: Re: [RFC PATCH v2 00/17] Core scheduling v2

> It's clear now, thanks.
> I don't immediately see how my isolation fix would make your fix stop
> working, will need to check. But I'm busy with other stuffs so it will
> take a while.
>
We have identified the issue and have a fix for this. The issue is
same as before, forced idle sibling has a runnable process which
is starved due to an unconstrained pick bug.

One sample scenario is like this:
cpu0 and cpu1 are siblings. cpu0 selects an untagged process 'a'
which forces idle on cpu1 even though it had a runnable tagged
process 'b' which is determined by the code to be of lesser priority.
cpu1 can go to deep idle.

During the next schedule in cpu0, the following could happen:
 - cpu0 selects swapper as there is nothing to run and hence
   prev_cookie is 0, it does an unconstrained pick of swapper.
   So both cpu0 and 1 are idling and cpu1 might be deep idle.
 - cpu0 again goes to schedule and selects 'a' which is runnable
   now. since prev_cookie is 0, 'a' is an unconstrained pick and
   'b' on cpu1 is forgotten again.

This continues with swapper and process 'a' taking turns without
considering sibling until a tagged process becomes runnable in cpu0
and then we don't get into unconstrained pick.

The above is one of the couple of scenarios we have seen and each
have a slightly different path, which ultimately leads to an
unconstrianed pick, starving the sibling's runnable thread.

The fix is to mark if a core has gone forced idle when there was a
runnable process and then do not do uncontrained pick if a forced
idle happened in the last pick.

I am attaching here wth, the patch that fixes the above issue. Patch
is on top of Peter's fix and your correctness fix that we modified for
v2. We have a public reposiory with all the changes including this
fix as well:
https://github.com/digitalocean/linux-coresched/tree/coresched

We are working on a v3 where the last 3 commits will be squashed to
their related patches in v2. We hope to come up with a v3 next week
with all the suggestions and fixes posted in v2.

Thanks,
Vineeth

---
 kernel/sched/core.c  | 26 ++++++++++++++++++++++----
 kernel/sched/sched.h |  1 +
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 413d46bde17d..3aba0f8fe384 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3653,8 +3653,8 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	struct task_struct *next, *max = NULL;
 	const struct sched_class *class;
 	const struct cpumask *smt_mask;
-	unsigned long prev_cookie;
 	int i, j, cpu, occ = 0;
+	bool need_sync = false;
 
 	if (!sched_core_enabled(rq))
 		return __pick_next_task(rq, prev, rf);
@@ -3702,7 +3702,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 	 * 'Fix' this by also increasing @task_seq for every pick.
 	 */
 	rq->core->core_task_seq++;
-	prev_cookie = rq->core->core_cookie;
+	need_sync = !!rq->core->core_cookie;
 
 	/* reset state */
 	rq->core->core_cookie = 0UL;
@@ -3711,6 +3711,11 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 
 		rq_i->core_pick = NULL;
 
+		if (rq_i->core_forceidle) {
+			need_sync = true;
+			rq_i->core_forceidle = false;
+		}
+
 		if (i != cpu)
 			update_rq_clock(rq_i);
 	}
@@ -3743,7 +3748,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 				 * If there weren't no cookies; we don't need
 				 * to bother with the other siblings.
 				 */
-				if (i == cpu && !prev_cookie)
+				if (i == cpu && !need_sync)
 					goto next_class;
 
 				continue;
@@ -3753,7 +3758,7 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 			 * Optimize the 'normal' case where there aren't any
 			 * cookies and we don't need to sync up.
 			 */
-			if (i == cpu && !prev_cookie && !p->core_cookie) {
+			if (i == cpu && !need_sync && !p->core_cookie) {
 				next = p;
 				rq->core_pick = NULL;
 				rq->core->core_cookie = 0UL;
@@ -3816,7 +3821,16 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
 					}
 					occ = 1;
 					goto again;
+				} else {
+					/*
+					 * Once we select a task for a cpu, we
+					 * should not be doing an unconstrained
+					 * pick because it might starve a task
+					 * on a forced idle cpu.
+					 */
+					need_sync = true;
 				}
+
 			}
 		}
 next_class:;
@@ -3843,6 +3857,9 @@ next_class:;
 
 		WARN_ON_ONCE(!rq_i->core_pick);
 
+		if (is_idle_task(rq_i->core_pick) && rq_i->nr_running)
+			rq->core_forceidle = true;
+
 		rq_i->core_pick->core_occupation = occ;
 
 		if (i == cpu)
@@ -6746,6 +6763,7 @@ void __init sched_init(void)
 		rq->core_pick = NULL;
 		rq->core_enabled = 0;
 		rq->core_tree = RB_ROOT;
+		rq->core_forceidle = false;
 
 		rq->core_cookie = 0UL;
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f38d4149443b..74c29afa0f32 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -964,6 +964,7 @@ struct rq {
 	unsigned int		core_enabled;
 	unsigned int		core_sched_seq;
 	struct rb_root		core_tree;
+	bool			core_forceidle;
 
 	/* shared state */
 	unsigned int		core_task_seq;
-- 
2.17.1