[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260121-v8-patch-series-v8-1-b7f1cbee5055@os.amperecomputing.com>
Date: Wed, 21 Jan 2026 01:31:53 -0800
From: Shubhang Kaushik <shubhang@...amperecomputing.com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Shubhang Kaushik <sh@...two.org>,
Valentin Schneider <vschneid@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>
Cc: Huang Shijie <shijie8@...il.com>, linux-kernel@...r.kernel.org,
Shubhang Kaushik <shubhang@...amperecomputing.com>
Subject: [PATCH v8] sched: update rq->avg_idle when a task is moved to an
idle CPU
Currently, rq->idle_stamp is only used to calculate avg_idle during
wakeups. This means other paths that move a task to an idle CPU such as
fork/clone, execve, or migrations, do not end the CPU's idle status in
the scheduler's eyes, leading to an inaccurate avg_idle.
This patch introduces update_rq_avg_idle() to provide a more accurate
measurement of CPU idle duration. By invoking this helper in
put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU
stops being idle, regardless of how the new task arrived.
Changes in v8:
- Removed the 'if (rq->idle_stamp)' check: Based on reviewer feedback,
tracking any idle duration (not just fair-class specific) provides a
more universal view of core availability.
Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline:
- Hackbench : +7.2% performance gain at 16 threads.
- Schbench: Reduced p99.9 tail latencies at high concurrency.
Tested-by: Shubhang Kaushik <shubhang@...amperecomputing.com>
Signed-off-by: Shubhang Kaushik <shubhang@...amperecomputing.com>
---
This series improves the accuracy of rq->avg_idle by ensuring the CPU's idle
duration is updated whenever a task moves to an idle CPU.
The rq->idle_stamp is only cleared during wakeups. This leaves other paths
that move a task to an idle CPU, such as fork, exec, or load balancing
migrations, unable to end the CPU's idle status in the scheduler's view.
This architectural gap produces stale avg_idle values, misleading the
new idle balancer into incorrectly skipping task migrations and degrading
overall throughput on high core count systems.
v7--> v8:
Remove the 'if (rq->idle_stamp)' condition check in
update_rq_avg_idle().
--v7:https://lkml.org/lkml/2025/12/26/90
v6--> v7:
Call the update_rq_avg_idle() in the put_prev_task_idle().
Remove the patch 1 in the original patch set.
--v6:https://lkml.org/lkml/2025/12/9/377
v5--> v6:
Remove "this_rq->idle_stamp = 0;" in patch 1.
Update the test result with Specjbb.
--v5:https://lkml.org/lkml/2025/12/3/179
v4--> v5:
Modify the changelog.
--v4:https://lkml.org/lkml/2025/11/28/300
v3--> v4:
Remove the code for delayed task.
--v3: https://lkml.org/lkml/2025/11/27/456
v2--> v3:
-- merge patch 3 into patch 2:
move update_rq_avg_idle() to enqueue_task().
--v2: https://lkml.org/lkml/2025/11/27/214
v1--> v2:
-- Put update_rq_avg_idle() to activate_task()
-- Add Delay-dequeue task check.
--v1: https://lkml.org/lkml/2025/11/24/97
kernel/sched/core.c | 23 +++++++++++------------
kernel/sched/idle.c | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 13 insertions(+), 12 deletions(-)
--
2.52.0
sched/core: update rq->avg_idle when a task is moved to an idle CPU
---
kernel/sched/core.c | 24 ++++++++++++------------
kernel/sched/idle.c | 1 +
kernel/sched/sched.h | 1 +
3 files changed, 14 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 045f83ad261e25283d290fd064ad47cd2399dc79..81a841e22c961ff04ad291eeeed81147fd464324 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3607,6 +3607,18 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
trace_sched_wakeup(p);
}
+void update_rq_avg_idle(struct rq *rq)
+{
+ u64 delta = rq_clock(rq) - rq->idle_stamp;
+ u64 max = 2*rq->max_idle_balance_cost;
+
+ update_avg(&rq->avg_idle, delta);
+
+ if (rq->avg_idle > max)
+ rq->avg_idle = max;
+ rq->idle_stamp = 0;
+}
+
static void
ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
struct rq_flags *rf)
@@ -3642,18 +3654,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
p->sched_class->task_woken(rq, p);
rq_repin_lock(rq, rf);
}
-
- if (rq->idle_stamp) {
- u64 delta = rq_clock(rq) - rq->idle_stamp;
- u64 max = 2*rq->max_idle_balance_cost;
-
- update_avg(&rq->avg_idle, delta);
-
- if (rq->avg_idle > max)
- rq->avg_idle = max;
-
- rq->idle_stamp = 0;
- }
}
/*
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c174afe1dd177a22535417be0de1fc1b690c0368..36ddc5bcfa0383bd4d07d3c8b732ee5b8567d194 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -460,6 +460,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct t
{
update_curr_idle(rq);
scx_update_idle(rq, false, true);
+ update_rq_avg_idle(rq);
}
static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 93fce4bbff5eac1d4719394e89dfae886b74d865..7edf8600f2c3f45afa32bc73db2155ea6e0067f0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1676,6 +1676,7 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
#endif /* !CONFIG_FAIR_GROUP_SCHED */
+extern void update_rq_avg_idle(struct rq *rq);
extern void update_rq_clock(struct rq *rq);
/*
---
base-commit: 24d479d26b25bce5faea3ddd9fa8f3a6c3129ea7
change-id: 20260116-v8-patch-series-5ff91b821cd4
Best regards,
--
Shubhang Kaushik <shubhang@...amperecomputing.com>
Powered by blists - more mailing lists