[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260127151748.GA1079264@noisy.programming.kicks-ass.net>
Date: Tue, 27 Jan 2026 16:17:48 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Mario Roy <marioeroy@...il.com>
Cc: Chris Mason <clm@...a.com>,
Joseph Salisbury <joseph.salisbury@...cle.com>,
Adam Li <adamli@...amperecomputing.com>,
Hazem Mohamed Abuelfotoh <abuehaze@...zon.com>,
Josh Don <joshdon@...gle.com>, mingo@...hat.com,
juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
kprateek.nayak@....com
Subject: Re: [PATCH 4/4] sched/fair: Proportional newidle balance
On Tue, Jan 27, 2026 at 11:40:41AM +0100, Peter Zijlstra wrote:
> On Fri, Jan 23, 2026 at 12:03:06PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 23, 2026 at 11:50:46AM +0100, Peter Zijlstra wrote:
> > > On Sun, Jan 18, 2026 at 03:46:22PM -0500, Mario Roy wrote:
> > > > The patch "Proportional newidle balance" introduced a regression
> > > > with Linux 6.12.65 and 6.18.5. There is noticeable regression with
> > > > easyWave testing. [1]
> > > >
> > > > The CPU is AMD Threadripper 9960X CPU (24/48). I followed the source
> > > > to install easyWave [2]. That is fetching the two tar.gz archives.
> > >
> > > What is the actual configuration of that chip? Is it like 3*8 or 4*6
> > > (CCX wise). A quick google couldn't find me the answer :/
> >
> > Obviously I found it right after sending this. It's a 4x6 config.
> > Meaning it needs newidle to balance between those 4 domains.
>
> So with the below patch on top of my Xeon w7-2495X (which is 24-core
> 48-thread) I too have 4 LLC :-)
>
> And I think I can see a slight difference, but nowhere near as terrible.
>
> Let me go stick some tracing on.
Does this help some?
Turns out, this easywave thing has a very low newidle rate, but then
also a fairly low success rate. But since it doesn't do it that often,
the cost isn't that significant so we might as well always do it etc..
This adds a second term to the ratio computation that takes time into
account, For low rate newidle this term will dominate, while for higher
rate the success ratio is more important.
Chris, afaict this still DTRT for schbench, but if this works for Mario,
could you also re-run things at your end?
[ the 4 'second' thing is a bit random, but looking at the timings
between easywave and schbench this seems to be a reasonable middle
ground. Although I think 8 'seconds' -- 23 shift -- would also work.
That would give:
1024 - 8 s - 64 Hz
512 - 4 s - 128 Hz
256 - 2 s - 256 Hz
128 - 1 s - 512 Hz
64 - .5 s - 1024 Hz
32 - .25 s - 2048 Hz
]
---
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 45c0022b91ce..a1e1032426dc 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -95,6 +95,7 @@ struct sched_domain {
unsigned int newidle_call;
unsigned int newidle_success;
unsigned int newidle_ratio;
+ u64 newidle_stamp;
u64 max_newidle_lb_cost;
unsigned long last_decay_max_lb_cost;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eca642295c4b..ab9cf06c6a76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12224,8 +12224,31 @@ static inline void update_newidle_stats(struct sched_domain *sd, unsigned int su
sd->newidle_call++;
sd->newidle_success += success;
if (sd->newidle_call >= 1024) {
- sd->newidle_ratio = sd->newidle_success;
+ u64 now = sched_clock();
+ s64 delta = now - sd->newidle_stamp;
+ sd->newidle_stamp = now;
+ int ratio = 0;
+
+ if (delta < 0)
+ delta = 0;
+
+ if (sched_feat(NI_RATE)) {
+ /*
+ * ratio delta freq
+ *
+ * 1024 - 4 s - 128 Hz
+ * 512 - 2 s - 256 Hz
+ * 256 - 1 s - 512 Hz
+ * 128 - .5 s - 1024 Hz
+ * 64 - .25 s - 2048 Hz
+ */
+ ratio = delta >> 22;
+ }
+
+ ratio += sd->newidle_success;
+
+ sd->newidle_ratio = min(1024, ratio);
sd->newidle_call /= 2;
sd->newidle_success /= 2;
}
@@ -12932,7 +12959,7 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
if (sd->flags & SD_BALANCE_NEWIDLE) {
unsigned int weight = 1;
- if (sched_feat(NI_RANDOM)) {
+ if (sched_feat(NI_RANDOM) && sd->newidle_ratio < 1024) {
/*
* Throw a 1k sided dice; and only run
* newidle_balance according to the success
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 980d92bab8ab..7aba7523c6c1 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -126,3 +126,4 @@ SCHED_FEAT(LATENCY_WARN, false)
* Do newidle balancing proportional to its success rate using randomization.
*/
SCHED_FEAT(NI_RANDOM, true)
+SCHED_FEAT(NI_RATE, true)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index cf643a5ddedd..05741f18f334 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -4,6 +4,7 @@
*/
#include <linux/sched/isolation.h>
+#include <linux/sched/clock.h>
#include <linux/bsearch.h>
#include "sched.h"
@@ -1637,6 +1638,7 @@ sd_init(struct sched_domain_topology_level *tl,
struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
int sd_id, sd_weight, sd_flags = 0;
struct cpumask *sd_span;
+ u64 now = sched_clock();
sd_weight = cpumask_weight(tl->mask(tl, cpu));
@@ -1674,6 +1676,7 @@ sd_init(struct sched_domain_topology_level *tl,
.newidle_call = 512,
.newidle_success = 256,
.newidle_ratio = 512,
+ .newidle_stamp = now,
.max_newidle_lb_cost = 0,
.last_decay_max_lb_cost = jiffies,
Powered by blists - more mailing lists