linux-kernel - Re: [PATCH v8] sched: update rq->avg_idle when a task is moved to an idle CPU

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <953b0294-312e-4f58-a459-43623a0f5128@amd.com>
Date: Thu, 22 Jan 2026 12:10:20 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Shubhang Kaushik <shubhang@...amperecomputing.com>, Ingo Molnar
	<mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Juri Lelli
	<juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Shubhang Kaushik <sh@...two.org>, Valentin Schneider
	<vschneid@...hat.com>
CC: Huang Shijie <shijie8@...il.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v8] sched: update rq->avg_idle when a task is moved to an
 idle CPU

Hello Shubhang,

On 1/21/2026 3:01 PM, Shubhang Kaushik wrote:
> Currently, rq->idle_stamp is only used to calculate avg_idle during
> wakeups. This means other paths that move a task to an idle CPU such as
> fork/clone, execve, or migrations, do not end the CPU's idle status in
> the scheduler's eyes, leading to an inaccurate avg_idle.
> 
> This patch introduces update_rq_avg_idle() to provide a more accurate
> measurement of CPU idle duration. By invoking this helper in
> put_prev_task_idle(), we ensure avg_idle is updated whenever a CPU
> stops being idle, regardless of how the new task arrived.
> 
> Changes in v8:
> - Removed the 'if (rq->idle_stamp)' check: Based on reviewer feedback,
>   tracking any idle duration (not just fair-class specific) provides a
>   more universal view of core availability.
> 
> Testing on an 80-core Ampere Altra (ARMv8) with 6.19-rc5 baseline:
> - Hackbench : +7.2% performance gain at 16 threads.
> - Schbench: Reduced p99.9 tail latencies at high concurrency.
> 
> Tested-by: Shubhang Kaushik <shubhang@...amperecomputing.com>
> Signed-off-by: Shubhang Kaushik <shubhang@...amperecomputing.com>

For most part I haven't observed any regressions. The one that shows up
more consistently is with tbench-256 clients which seems to be super
sensitive on any newidle balance changes on my setup:

  =====================================
  Test          : tbench
  Units         : Normalized throughput
  Interpretation: Higher is better
  Statistic     : %diff in AMean
  =====================================

    Clients     %diff
          1       -3%
          2       -1%
          4        0%
          8        0%
         16        0%
         32       -1%
         64       -2%
        128       -1%
        256       -5% *
        512        0%
       1024        0%

Note: During reruns with profiling, I've seen the results on tip being
more closer to the patched kernel (~2% - within the margin of error).

Looking at schedstats show a lot more idle and newidle balance attempts
within the MC domain (MC, SMT) in the bad case:

  ----------------------------------------------------------------------------------------------------
  CPU: <ALL CPUS SUMMARY> | DOMAIN: SMT
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANGE     AVG_JIFFIES1 AVG_JIFFIES2
  ----------------------------------------- <Category idle> ------------------------------------------
  idle_lb_count                                                    :         721,       1080  |    49.79% |  $       28.30,       18.89 $
  idle_lb_balanced                                                 :         638,        835  |    30.88% |  $       31.98,       24.43 $
  idle_lb_failed                                                   :          60,        173  |   188.33% |  $      340.03,      117.90 $
  ...
  *idle_lb_success_count                                           :          23,         72  |   213.04% |
  *idle_lb_avg_pulled                                              :        1.13,       1.08  |    -4.17% |
  ---------------------------------------- <Category newidle> ----------------------------------------
  newidle_lb_count                                                 :        3964,      17961  |   353.10% |  $        5.15,        1.14 $
  newidle_lb_balanced                                              :        3235,      13723  |   324.20% |  $        6.31,        1.49 $
  newidle_lb_failed                                                :         540,       3227  |   497.59% |  $       37.78,        6.32 $
  ...
  *newidle_lb_success_count                                        :         189,       1011  |   434.92% |
  *newidle_lb_avg_pulled                                           :        0.99,       1.00  |     0.43% |
  --------------------------------- <Category active_load_balance()> ---------------------------------
  
  ----------------------------------------------------------------------------------------------------
  CPU: <ALL CPUS SUMMARY> | DOMAIN: MC
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2   PCT_CHANGE     AVG_JIFFIES1 AVG_JIFFIES2
  ----------------------------------------- <Category idle> ------------------------------------------
  idle_lb_count                                                    :         301,        527  |    75.08% |  $       67.78,       38.70 $
  idle_lb_balanced                                                 :          97,        128  |    31.96% |  $      210.33,      159.34 $
  idle_lb_failed                                                   :         179,        354  |    97.77% |  $      113.98,       57.62 $
  ...
  *idle_lb_success_count                                           :          25,         45  |    80.00% |
  *idle_lb_avg_pulled                                              :        1.52,       1.40  |    -7.89% |
  ---------------------------------------- <Category newidle> ----------------------------------------
  newidle_lb_count                                                 :        1917,       7022  |   266.30% |  $       10.64,        2.90 $
  newidle_lb_balanced                                              :         380,        793  |   108.68% |  $       53.69,       25.72 $
  newidle_lb_failed                                                :        1481,       6011  |   305.87% |  $       13.78,        3.39 $
  ...
  *newidle_lb_success_count                                        :          56,        218  |   289.29% |
  *newidle_lb_avg_pulled                                           :        0.98,       1.00  |     1.35% |
  --------------------------------- <Category active_load_balance()> ---------------------------------

  (Full schedstats diff attached below)

For PKG and above, the difference isn't too much. The success count also
increases proportional to the attempts but seems like the fact that we
are doing those additional attempts isn't sitting too well with this
particular benchmark.

tbench has these super short sleep durations and benefits from running
client and server on the same LLC domain. schbench latencies for similar
configs don't show any difference so I wouldn't worry too much about this
specific regression.

Feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@....com>

-- 
Thanks and Regards,
Prateek

View attachment "perf.sched_stats.diff" of type "text/plain" (29933 bytes)