lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtCv6zsDK=WF5io3oMAaTivQMTPTwSO8cZO8mdsMCQS-iQ@mail.gmail.com>
Date: Wed, 21 Jan 2026 17:40:29 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Shubhang Kaushik <shubhang@...amperecomputing.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Shubhang Kaushik <sh@...two.org>, Valentin Schneider <vschneid@...hat.com>, 
	K Prateek Nayak <kprateek.nayak@....com>, Huang Shijie <shijie8@...il.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8] sched: update rq->avg_idle when a task is moved to an
 idle CPU

On Wed, 21 Jan 2026 at 10:33, Shubhang Kaushik
<shubhang@...amperecomputing.com> wrote:
>
> Currently, rq->idle_stamp is only used to calculate avg_idle during
> wakeups. This means other paths that move a task to an idle CPU such as
> fork/clone, execve, or migrations, do not end the CPU's idle status in
> the scheduler's eyes, leading to an inaccurate avg_idle.
>
> This patch introduces update_rq_avg_idle() to provide a more accurate
> measurement of CPU idle duration. By invoking this helper in
> put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU
> stops being idle, regardless of how the new task arrived.
>
> Changes in v8:
> - Removed the 'if (rq->idle_stamp)' check: Based on reviewer feedback,
>   tracking any idle duration (not just fair-class specific) provides a
>   more universal view of core availability.
>
> Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline:
> - Hackbench : +7.2% performance gain at 16 threads.
> - Schbench: Reduced p99.9 tail latencies at high concurrency.
>
> Tested-by: Shubhang Kaushik <shubhang@...amperecomputing.com>
> Signed-off-by: Shubhang Kaushik <shubhang@...amperecomputing.com>

Reviewed-by: Vincent Guittot <vincent.guittot@...aro.org>


> ---
> This series improves the accuracy of rq->avg_idle by ensuring the CPU's idle
> duration is updated whenever a task moves to an idle CPU.
>
> The rq->idle_stamp is only cleared during wakeups. This leaves other paths
> that move a task to an idle CPU, such as fork, exec, or load balancing
> migrations, unable to end the CPU's idle status in the scheduler's view.
> This architectural gap produces stale avg_idle values, misleading the
> new idle balancer into incorrectly skipping task migrations and degrading
> overall throughput on high core count systems.
>
> v7--> v8:
>     Remove the 'if (rq->idle_stamp)' condition check in
>     update_rq_avg_idle().
>     --v7:https://lkml.org/lkml/2025/12/26/90
>
> v6--> v7:
>     Call the update_rq_avg_idle() in the put_prev_task_idle().
>     Remove the patch 1 in the original patch set.
>    --v6:https://lkml.org/lkml/2025/12/9/377
>
> v5--> v6:
>     Remove "this_rq->idle_stamp = 0;" in patch 1.
>     Update the test result with Specjbb.
>    --v5:https://lkml.org/lkml/2025/12/3/179
>
> v4--> v5:
>     Modify the changelog.
>
>    --v4:https://lkml.org/lkml/2025/11/28/300
>
> v3--> v4:
>      Remove the code for delayed task.
>
>    --v3: https://lkml.org/lkml/2025/11/27/456
>
> v2--> v3:
>   -- merge patch 3 into patch 2:
>       move update_rq_avg_idle() to enqueue_task().
>
>    --v2: https://lkml.org/lkml/2025/11/27/214
>
> v1--> v2:
>   -- Put update_rq_avg_idle() to activate_task()
>   -- Add Delay-dequeue task check.
>
>    --v1: https://lkml.org/lkml/2025/11/24/97
>
> kernel/sched/core.c | 23 +++++++++++------------
> kernel/sched/idle.c | 1 +
> kernel/sched/sched.h | 1 +
> 3 files changed, 13 insertions(+), 12 deletions(-)
> --
> 2.52.0
>
> sched/core: update rq->avg_idle when a task is moved to an idle CPU
> ---
>  kernel/sched/core.c  | 24 ++++++++++++------------
>  kernel/sched/idle.c  |  1 +
>  kernel/sched/sched.h |  1 +
>  3 files changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 045f83ad261e25283d290fd064ad47cd2399dc79..81a841e22c961ff04ad291eeeed81147fd464324 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3607,6 +3607,18 @@ static inline void ttwu_do_wakeup(struct task_struct *p)
>         trace_sched_wakeup(p);
>  }
>
> +void update_rq_avg_idle(struct rq *rq)
> +{
> +       u64 delta = rq_clock(rq) - rq->idle_stamp;
> +       u64 max = 2*rq->max_idle_balance_cost;
> +
> +       update_avg(&rq->avg_idle, delta);
> +
> +       if (rq->avg_idle > max)
> +               rq->avg_idle = max;
> +       rq->idle_stamp = 0;
> +}
> +
>  static void
>  ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>                  struct rq_flags *rf)
> @@ -3642,18 +3654,6 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
>                 p->sched_class->task_woken(rq, p);
>                 rq_repin_lock(rq, rf);
>         }
> -
> -       if (rq->idle_stamp) {
> -               u64 delta = rq_clock(rq) - rq->idle_stamp;
> -               u64 max = 2*rq->max_idle_balance_cost;
> -
> -               update_avg(&rq->avg_idle, delta);
> -
> -               if (rq->avg_idle > max)
> -                       rq->avg_idle = max;
> -
> -               rq->idle_stamp = 0;
> -       }
>  }
>
>  /*
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index c174afe1dd177a22535417be0de1fc1b690c0368..36ddc5bcfa0383bd4d07d3c8b732ee5b8567d194 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -460,6 +460,7 @@ static void put_prev_task_idle(struct rq *rq, struct task_struct *prev, struct t
>  {
>         update_curr_idle(rq);
>         scx_update_idle(rq, false, true);
> +       update_rq_avg_idle(rq);
>  }
>
>  static void set_next_task_idle(struct rq *rq, struct task_struct *next, bool first)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 93fce4bbff5eac1d4719394e89dfae886b74d865..7edf8600f2c3f45afa32bc73db2155ea6e0067f0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1676,6 +1676,7 @@ static inline struct cfs_rq *group_cfs_rq(struct sched_entity *grp)
>
>  #endif /* !CONFIG_FAIR_GROUP_SCHED */
>
> +extern void update_rq_avg_idle(struct rq *rq);
>  extern void update_rq_clock(struct rq *rq);
>
>  /*
>
> ---
> base-commit: 24d479d26b25bce5faea3ddd9fa8f3a6c3129ea7
> change-id: 20260116-v8-patch-series-5ff91b821cd4
>
> Best regards,
> --
> Shubhang Kaushik <shubhang@...amperecomputing.com>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ