linux-kernel - Re: [PATCH 4/4] sched/fair: Proportional newidle balance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b99dd88-94ae-4469-a34a-24c32aa3af81@gmail.com>
Date: Thu, 29 Jan 2026 20:44:50 -0500
From: Mario Roy <marioeroy@...il.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Chris Mason <clm@...a.com>, Joseph Salisbury
 <joseph.salisbury@...cle.com>, Adam Li <adamli@...amperecomputing.com>,
 Hazem Mohamed Abuelfotoh <abuehaze@...zon.com>, Josh Don
 <joshdon@...gle.com>, mingo@...hat.com, juri.lelli@...hat.com,
 vincent.guittot@...aro.org, dietmar.eggemann@....com, rostedt@...dmis.org,
 bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com,
 linux-kernel@...r.kernel.org, kprateek.nayak@....com,
 shubhang@...amperecomputing.com, arighi@...dia.com
Subject: Re: [PATCH 4/4] sched/fair: Proportional newidle balance

Peter, thank you for your fix to improve EEVDF.

Cc'd Andrea Righi
Thank you for the is_idle_core() function and help. [0]

Cc'd Shubhang Kaushik
Your patch inspired me to perform trial and error testing.
What has now become the 0280 patch in CachyMod GitHub repo. [0]

Together with the help of CachyOS community members, we concluded
the prefcore + prefer-idle-core to be surreal. I enjoy the EEVDF
scheduler a lot more, since lesser favoring the SMT siblings.

For comparison, I added results for sched-ext cosmos.

Limited CPU saturation can be revealing of potential scheduler issues.
Testing includes 100%, 50%, 31.25%, and 25% CPU saturation.
All kernels built with GCC to factor out CLANG/AutoFDO.

A) 6.18.8-rc1
    with sched/fair: Proportional newidle balance

                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
    algorithm3 [1]       9.462s      14.181s        20.311s 24.498s
    darktable  [2]       2.811s       3.715s         5.315s  6.434s
    easywave   [3]      19.747s      10.804s        20.207s 21.571s
    stress-ng  [4]     37632.06     56220.21       41694.50  34740.58

B) 6.18.8-rc1
    Peter Z's fix for sched/fair: Proportional newidle balance

                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
    algorithm3 [1]       9.340s      14.733s        21.339s 25.069s
    darktable  [2]       2.493s       3.616s         5.148s  5.968s
    easywave   [3]      11.357s      13.312s *      18.483s 20.741s
    stress-ng  [4]     37533.24     55419.85       39452.17  32217.55

    algorithm3 and stress-ng regressed, possibly limited CPU saturation 
anomaly
    easywave (*) wierd result, repeatable and all over the place

C) 6.18.8-rc1
    Revert sched/fair: Proportional newidle balance

                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
    algorithm3 [1]       9.286s      15.101s        21.417s 25.126s
    darktable  [2]       2.484s       3.531s         5.185s  6.002s
    easywave   [3]      11.517s      12.300s        18.466s 20.428s
    stress-ng  [4]     42231.92     47306.18 *     32438.03 *  28820.83 *

    stress-ng (*) lack-luster with limited CPU saturation

D) 6.18.8-rc1
    Revert sched/fair: Proportional newidle balance
    Plus apply the prefer-idle-core patch [0]

                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
    algorithm3 [1]       9.312s      11.292s        17.243s 21.811s
    darktable  [2]       2.418s       3.711s *       5.499s *  6.510s *
    easywave   [3]      10.035s       9.832s        15.738s 18.805s
    stress-ng  [4]     44837.41     63364.56       55646.26  48202.58

    darktable (*) lesser performance with limited CPU saturation
    noticeably better performance, otherwise

E) scx_cosmos -m 0-5 -s 800 -l 8000 -f -c 1 -p 0 [5]

                     48cpus(100%)  24cpus(50%)  15cpus(31.25%) 12cpus(25%)
    algorithm3 [1]       9.218s      11.188s        17.045s 21.130s
    darktable  [2]       2.365s       3.900s         4.626s  5.664s
    easywave   [3]       9.187s      16.528s *      15.933s 16.991s
    stress-ng  [4]     21065.70     36417.65       27185.95  23141.87

    easywave (*) sched-ext cosmos appears to favor SMT siblings

---
[0] https://github.com/marioroy/cachymod
     the prefer-idle-core is 0280-prefer-prevcpu-for-wakeup.patch
     more about mindfulness for limited CPU saturation versus accepting 
patch
     surreal is prefcore + prefer-idle-core, improving many workloads

[1] https://github.com/marioroy/mce-sandbox
     ./algorithm3.pl 1e12 --threads=N
     algorithm3.pl is akin to server/client application; chatty
     primesieve.pl is more CPU-bound; less chatty
     optionally, compare with primesieve binary (fully cpu bound, no chatty)
     https://github.com/kimwalisch/primesieve

[2] https://math.dartmouth.edu/~sarunas/darktable_bench.html
     OMP_NUM_THREADS=N darktable-cli setubal.orf setubal.orf.xmp test.jpg \
     --core --disable-opencl -d perf
     result: pixel pipeline processing took {...} secs

[3] https://openbenchmarking.org/test/pts/easywave
     OMP_NUM_THREADS=N ./src/easywave \
     -grid examples/e2Asean.grd -source examples/BengkuluSept2007.flt \
     -time 600
     result: Model time = 10:00:00,   elapsed: {...} msec

[4] https://openbenchmarking.org/test/pts/stress-ng
     stress-ng -t 30 --metrics-brief --sock N --no-rand-seed --sock-zerocopy
     result: bogo ops real time  usr time  sys time   bogo ops/s  bogo ops/s
                        (secs)    (secs)    (secs)   (real time) 
(usr+sys time)
                                                        {...}
     this involves 2x NCPUs due to { writer, reader } threads per sock
     thus the reason adding 12cpus result (12 x 2 = 24 <= 50% saturation)

[5] https://github.com/sched-ext/scx
     cargo build --release -p scx_cosmos

On 1/27/26 10:17 AM, Peter Zijlstra wrote:
> On Tue, Jan 27, 2026 at 11:40:41AM +0100, Peter Zijlstra wrote:
>> On Fri, Jan 23, 2026 at 12:03:06PM +0100, Peter Zijlstra wrote:
>>> On Fri, Jan 23, 2026 at 11:50:46AM +0100, Peter Zijlstra wrote:
>>>> On Sun, Jan 18, 2026 at 03:46:22PM -0500, Mario Roy wrote:
>>>>> The patch "Proportional newidle balance" introduced a regression
>>>>> with Linux 6.12.65 and 6.18.5. There is noticeable regression with
>>>>> easyWave testing. [1]
>>>>>
>>>>> The CPU is AMD Threadripper 9960X CPU (24/48). I followed the source
>>>>> to install easyWave [2]. That is fetching the two tar.gz archives.
>>>> What is the actual configuration of that chip? Is it like 3*8 or 4*6
>>>> (CCX wise). A quick google couldn't find me the answer :/
>>> Obviously I found it right after sending this. It's a 4x6 config.
>>> Meaning it needs newidle to balance between those 4 domains.
>> So with the below patch on top of my Xeon w7-2495X (which is 24-core
>> 48-thread) I too have 4 LLC :-)
>>
>> And I think I can see a slight difference, but nowhere near as terrible.
>>
>> Let me go stick some tracing on.
> Does this help some?
>
> Turns out, this easywave thing has a very low newidle rate, but then
> also a fairly low success rate. But since it doesn't do it that often,
> the cost isn't that significant so we might as well always do it etc..
>
> This adds a second term to the ratio computation that takes time into
> account, For low rate newidle this term will dominate, while for higher
> rate the success ratio is more important.
>
> Chris, afaict this still DTRT for schbench, but if this works for Mario,
> could you also re-run things at your end?
>
> [ the 4 'second' thing is a bit random, but looking at the timings
>    between easywave and schbench this seems to be a reasonable middle
>    ground. Although I think 8 'seconds' -- 23 shift -- would also work.
>
>    That would give:
>
>      1024 -  8  s -   64 Hz
>       512 -  4  s -  128 Hz
>       256 -  2  s -  256 Hz
>       128 -  1  s -  512 Hz
>        64 - .5  s - 1024 Hz
>        32 - .25 s - 2048 Hz
> ]
>
> ---
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 45c0022b91ce..a1e1032426dc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -95,6 +95,7 @@ struct sched_domain {
>   	unsigned int newidle_call;
>   	unsigned int newidle_success;
>   	unsigned int newidle_ratio;
> +	u64 newidle_stamp;
>   	u64 max_newidle_lb_cost;
>   	unsigned long last_decay_max_lb_cost;
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eca642295c4b..ab9cf06c6a76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12224,8 +12224,31 @@ static inline void update_newidle_stats(struct sched_domain *sd, unsigned int su
>   	sd->newidle_call++;
>   	sd->newidle_success += success;
>   
>   	if (sd->newidle_call >= 1024) {
> -		sd->newidle_ratio = sd->newidle_success;
> +		u64 now = sched_clock();
> +		s64 delta = now - sd->newidle_stamp;
> +		sd->newidle_stamp = now;
> +		int ratio = 0;
> +
> +		if (delta < 0)
> +			delta = 0;
> +
> +		if (sched_feat(NI_RATE)) {
> +			/*
> +			 * ratio  delta   freq
> +			 *
> +			 * 1024 -  4  s -  128 Hz
> +			 *  512 -  2  s -  256 Hz
> +			 *  256 -  1  s -  512 Hz
> +			 *  128 - .5  s - 1024 Hz
> +			 *   64 - .25 s - 2048 Hz
> +			 */
> +			ratio = delta >> 22;
> +		}
> +
> +		ratio += sd->newidle_success;
> +
> +		sd->newidle_ratio = min(1024, ratio);
>   		sd->newidle_call /= 2;
>   		sd->newidle_success /= 2;
>   	}
> @@ -12932,7 +12959,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
>   		if (sd->flags & SD_BALANCE_NEWIDLE) {
>   			unsigned int weight = 1;
>   
> -			if (sched_feat(NI_RANDOM)) {
> +			if (sched_feat(NI_RANDOM) && sd->newidle_ratio < 1024) {
>   				/*
>   				 * Throw a 1k sided dice; and only run
>   				 * newidle_balance according to the success
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 980d92bab8ab..7aba7523c6c1 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -126,3 +126,4 @@ SCHED_FEAT(LATENCY_WARN, false)
>    * Do newidle balancing proportional to its success rate using randomization.
>    */
>   SCHED_FEAT(NI_RANDOM, true)
> +SCHED_FEAT(NI_RATE, true)
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index cf643a5ddedd..05741f18f334 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -4,6 +4,7 @@
>    */
>   
>   #include <linux/sched/isolation.h>
> +#include <linux/sched/clock.h>
>   #include <linux/bsearch.h>
>   #include "sched.h"
>   
> @@ -1637,6 +1638,7 @@ sd_init(struct sched_domain_topology_level *tl,
>   	struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
>   	int sd_id, sd_weight, sd_flags = 0;
>   	struct cpumask *sd_span;
> +	u64 now = sched_clock();
>   
>   	sd_weight = cpumask_weight(tl->mask(tl, cpu));
>   
> @@ -1674,6 +1676,7 @@ sd_init(struct sched_domain_topology_level *tl,
>   		.newidle_call		= 512,
>   		.newidle_success	= 256,
>   		.newidle_ratio		= 512,
> +		.newidle_stamp		= now,
>   
>   		.max_newidle_lb_cost	= 0,
>   		.last_decay_max_lb_cost	= jiffies,