[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <338ec61022d4b5242e4af6d156beac53f20eacf2.1719295669.git.yu.c.chen@intel.com>
Date: Tue, 25 Jun 2024 15:22:09 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>
Cc: Mike Galbraith <efault@....de>,
Tim Chen <tim.c.chen@...el.com>,
Yujie Liu <yujie.liu@...el.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
Chen Yu <yu.chen.surf@...il.com>,
linux-kernel@...r.kernel.org,
Chen Yu <yu.c.chen@...el.com>
Subject: [PATCH 1/2] sched/fair: Record the average duration of a task
Record the average duration of a task, as there is a requirement
to leverage this information for better task placement.
At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.
Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
of each task can be up to 50%, which is inconsistent with the
short task duration.
Here's an example to show what the average duration is. Suppose
on CPUx, task p1 and p2 run alternatively:
--------------------> time
| p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
^ ^ ^
|_____________| |_____________________________________|
^
|
p1 dequeued
p1's duration is (1 + 0.5)ms. Because if p2 does not preempt p1, p1 can run 1.5ms.
This reflects the nature of a task: how long it wishes to run at most.
Suggested-by: Tim Chen <tim.c.chen@...el.com>
Signed-off-by: Chen Yu <yu.c.chen@...el.com>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 12 ++++++++++++
3 files changed, 17 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90691d99027e..78747d3954fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1339,6 +1339,9 @@ struct task_struct {
struct callback_head cid_work;
#endif
+ u64 prev_sleep_sum_runtime;
+ u64 duration_avg;
+
struct tlbflush_unmap_batch tlb_ubc;
/* Cache last used pipe for splice(): */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..7399c4143528 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4359,6 +4359,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->migration_pending = NULL;
#endif
init_sched_mm_cid(p);
+ p->prev_sleep_sum_runtime = 0;
+ p->duration_avg = 0;
}
DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41b58387023d..445877069fbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6833,6 +6833,15 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
static void set_next_buddy(struct sched_entity *se);
+static inline void dur_avg_update(struct task_struct *p)
+{
+ u64 dur;
+
+ dur = p->se.sum_exec_runtime - p->prev_sleep_sum_runtime;
+ p->prev_sleep_sum_runtime = p->se.sum_exec_runtime;
+ update_avg(&p->duration_avg, dur);
+}
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -6905,6 +6914,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
+ if (task_sleep)
+ dur_avg_update(p);
+
hrtick_update(rq);
}
--
2.25.1
Powered by blists - more mailing lists