linux-kernel - Re: [RFC PATCH v3 00/16] Core scheduling v3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f96350c1-25a9-0564-ff46-6658e96d726c@linux.intel.com>
Date:   Fri, 26 Jul 2019 14:29:55 -0700
From:   Tim Chen <tim.c.chen@...ux.intel.com>
To:     Julien Desfossez <jdesfossez@...italocean.com>,
        Aaron Lu <aaron.lu@...ux.alibaba.com>
Cc:     Aubrey Li <aubrey.intel@...il.com>,
        Subhra Mazumdar <subhra.mazumdar@...cle.com>,
        Vineeth Remanan Pillai <vpillai@...italocean.com>,
        Nishanth Aravamudan <naravamudan@...italocean.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Paul Turner <pjt@...gle.com>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>,
        Frédéric Weisbecker <fweisbec@...il.com>,
        Kees Cook <keescook@...omium.org>,
        Greg Kerr <kerrnel@...gle.com>, Phil Auld <pauld@...hat.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Pawan Gupta <pawan.kumar.gupta@...ux.intel.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Subject: Re: [RFC PATCH v3 00/16] Core scheduling v3

On 7/26/19 8:21 AM, Julien Desfossez wrote:
> On 25-Jul-2019 10:30:03 PM, Aaron Lu wrote:
>>
>> I tried a different approach based on vruntime with 3 patches following.
> [...]
> 
> We have experimented with this new patchset and indeed the fairness is
> now much better. Interactive tasks with v3 were complete starving when
> there were cpu-intensive tasks running, now they can run consistently.
> With my initial test of TPC-C running in large VMs with a lot of
> background noise VMs, the results are pretty similar to v3, I will run
> more thorough tests and report the results back here.

Aaron's patch inspired me to experiment with another approach to tackle
fairness.  The root problem with v3 was we didn't account for the forced
idle time when we compare the priority of tasks between two threads.

So what I did here is to account the forced idle time in the top cfs_rq's
min_vruntime when we update the runqueue clock.  When we are comparing
between two cfs runqueues, the task in cpu getting forced idle will now
be credited with the forced idle time. The effect should be similar to
Aaron's patches. Logic is a bit simpler and we don't need to use
one of the sibling's cfs_rq min_vruntime as a time base.

In really limited testing, it seems to have balanced fairness between two
tagged cgroups.

Tim

-------patch 1----------
From: Tim Chen <tim.c.chen@...ux.intel.com>
Date: Wed, 24 Jul 2019 13:58:18 -0700
Subject: [PATCH 1/2] sched: move sched fair prio comparison to fair.c

Move the priority comparison of two tasks in fair class to fair.c.
There is no functional change.

Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
---
 kernel/sched/core.c  | 21 ++-------------------
 kernel/sched/fair.c  | 21 +++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 3 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8ea87be56a1e..f78b8fdfd47c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -105,25 +105,8 @@ static inline bool prio_less(struct task_struct *a, struct task_struct *b)
 	if (pa == -1) /* dl_prio() doesn't work because of stop_class above */
 		return !dl_time_before(a->dl.deadline, b->dl.deadline);
 
-	if (pa == MAX_RT_PRIO + MAX_NICE)  { /* fair */
-		u64 a_vruntime = a->se.vruntime;
-		u64 b_vruntime = b->se.vruntime;
-
-		/*
-		 * Normalize the vruntime if tasks are in different cpus.
-		 */
-		if (task_cpu(a) != task_cpu(b)) {
-			b_vruntime -= task_cfs_rq(b)->min_vruntime;
-			b_vruntime += task_cfs_rq(a)->min_vruntime;
-
-			trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
-				     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
-				     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
-
-		}
-
-		return !((s64)(a_vruntime - b_vruntime) <= 0);
-	}
+	if (pa == MAX_RT_PRIO + MAX_NICE) /* fair */
+		return prio_less_fair(a, b);
 
 	return false;
 }
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02bff10237d4..e289b6e1545b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -602,6 +602,27 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 	return delta;
 }
 
+bool prio_less_fair(struct task_struct *a, struct task_struct *b)
+{
+	u64 a_vruntime = a->se.vruntime;
+	u64 b_vruntime = b->se.vruntime;
+
+	/*
+	 * Normalize the vruntime if tasks are in different cpus.
+	 */
+	if (task_cpu(a) != task_cpu(b)) {
+		b_vruntime -= task_cfs_rq(b)->min_vruntime;
+		b_vruntime += task_cfs_rq(a)->min_vruntime;
+
+		trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
+			     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
+			     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
+
+	}
+
+	return !((s64)(a_vruntime - b_vruntime) <= 0);
+}
+
 /*
  * The idea is to set a period in which each task runs once.
  *
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e91c188a452c..bdabe7ce1152 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1015,6 +1015,7 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 }
 
 extern void queue_core_balance(struct rq *rq);
+extern bool prio_less_fair(struct task_struct *a, struct task_struct *b);
 
 #else /* !CONFIG_SCHED_CORE */
 
-- 
2.20.1


--------------patch 2------------------
From: Tim Chen <tim.c.chen@...ux.intel.com>
Date: Thu, 25 Jul 2019 13:09:21 -0700
Subject: [PATCH 2/2] sched: Account the forced idle time

We did not account for the forced idle time when comparing two tasks
from different SMT thread in the same core.

Account it in root cfs_rq min_vruntime when we update the rq clock.  This will
allow for fair comparison of which task has higher priority from
two different SMT.

Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
---
 kernel/sched/core.c |  6 ++++++
 kernel/sched/fair.c | 22 ++++++++++++++++++----
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f78b8fdfd47c..d8fa74810126 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -393,6 +393,12 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
 
 	rq->clock_task += delta;
 
+#ifdef CONFIG_SCHED_CORE
+	/* Account the forced idle time by sibling */
+	if (rq->core_forceidle)
+		rq->cfs.min_vruntime += delta;
+#endif
+
 #ifdef CONFIG_HAVE_SCHED_AVG_IRQ
 	if ((irq_delta + steal) && sched_feat(NONTASK_CAPACITY))
 		update_irq_load_avg(rq, irq_delta + steal);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e289b6e1545b..1b2fd1271c51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -604,20 +604,34 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 
 bool prio_less_fair(struct task_struct *a, struct task_struct *b)
 {
-	u64 a_vruntime = a->se.vruntime;
-	u64 b_vruntime = b->se.vruntime;
+	u64 a_vruntime;
+	u64 b_vruntime;
 
 	/*
 	 * Normalize the vruntime if tasks are in different cpus.
 	 */
 	if (task_cpu(a) != task_cpu(b)) {
-		b_vruntime -= task_cfs_rq(b)->min_vruntime;
-		b_vruntime += task_cfs_rq(a)->min_vruntime;
+		struct sched_entity *sea = &a->se;
+		struct sched_entity *seb = &b->se;
+
+		while (sea->parent)
+			sea = sea->parent;
+		while (seb->parent)
+			seb = seb->parent;
+
+		a_vruntime = sea->vruntime;
+		b_vruntime = seb->vruntime;
+
+		b_vruntime -= task_rq(b)->cfs.min_vruntime;
+		b_vruntime += task_rq(a)->cfs.min_vruntime;
 
 		trace_printk("(%d:%Lu,%Lu,%Lu) <> (%d:%Lu,%Lu,%Lu)\n",
 			     a->pid, a_vruntime, a->se.vruntime, task_cfs_rq(a)->min_vruntime,
 			     b->pid, b_vruntime, b->se.vruntime, task_cfs_rq(b)->min_vruntime);
 
+	} else {
+		a_vruntime = a->se.vruntime;
+		b_vruntime = b->se.vruntime;
 	}
 
 	return !((s64)(a_vruntime - b_vruntime) <= 0);
-- 
2.20.1