linux-kernel - [PATCH 1/2] sched/fair: Record the average duration of a task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <338ec61022d4b5242e4af6d156beac53f20eacf2.1719295669.git.yu.c.chen@intel.com>
Date: Tue, 25 Jun 2024 15:22:09 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>
Cc: Mike Galbraith <efault@....de>,
	Tim Chen <tim.c.chen@...el.com>,
	Yujie Liu <yujie.liu@...el.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>,
	Chen Yu <yu.chen.surf@...il.com>,
	linux-kernel@...r.kernel.org,
	Chen Yu <yu.c.chen@...el.com>
Subject: [PATCH 1/2] sched/fair: Record the average duration of a task

Record the average duration of a task, as there is a requirement
to leverage this information for better task placement.

At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.

Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
of each task can be up to 50%, which is inconsistent with the
short task duration.

Here's an example to show what the average duration is. Suppose
on CPUx, task p1 and p2 run alternatively:

 --------------------> time

 | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
               ^               ^                                     ^
 |_____________|               |_____________________________________|
                                                                     ^
                                                                     |
                                                                  p1 dequeued

p1's duration is (1 + 0.5)ms. Because if p2 does not preempt p1, p1 can run 1.5ms.
This reflects the nature of a task: how long it wishes to run at most.

Suggested-by: Tim Chen <tim.c.chen@...el.com>
Signed-off-by: Chen Yu <yu.c.chen@...el.com>
---
 include/linux/sched.h |  3 +++
 kernel/sched/core.c   |  2 ++
 kernel/sched/fair.c   | 12 ++++++++++++
 3 files changed, 17 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e..78747d3954fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,9 @@ struct task_struct {
 	struct callback_head		cid_work;
 #endif
 
+	u64				prev_sleep_sum_runtime;
+	u64				duration_avg;
+
 	struct tlbflush_unmap_batch	tlb_ubc;
 
 	/* Cache last used pipe for splice(): */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..7399c4143528 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->migration_pending = NULL;
 #endif
 	init_sched_mm_cid(p);
+	p->prev_sleep_sum_runtime = 0;
+	p->duration_avg = 0;
 }
 
 DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..445877069fbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6833,6 +6833,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 static void set_next_buddy(struct sched_entity *se);
 
+static inline void dur_avg_update(struct task_struct *p)
+{
+	u64 dur;
+
+	dur = p->se.sum_exec_runtime - p->prev_sleep_sum_runtime;
+	p->prev_sleep_sum_runtime = p->se.sum_exec_runtime;
+	update_avg(&p->duration_avg, dur);
+}
+
 /*
  * The dequeue_task method is called before nr_running is
  * decreased. We remove the task from the rbtree and
@@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
+	if (task_sleep)
+		dur_avg_update(p);
+
 	hrtick_update(rq);
 }
 
-- 
2.25.1