[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <396c640b-81bb-4be4-860d-7ab3ff667795@amd.com>
Date: Wed, 28 Jan 2026 09:38:31 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>
CC: Mario Roy <marioeroy@...il.com>, Chris Mason <clm@...a.com>, "Joseph
Salisbury" <joseph.salisbury@...cle.com>, Adam Li
<adamli@...amperecomputing.com>, Hazem Mohamed Abuelfotoh
<abuehaze@...zon.com>, Josh Don <joshdon@...gle.com>, <mingo@...hat.com>,
<juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 4/4] sched/fair: Proportional newidle balance
On 1/23/2026 5:54 PM, K Prateek Nayak wrote:
> Hello Peter,
>
> On 1/23/2026 4:33 PM, Peter Zijlstra wrote:
>> On Fri, Jan 23, 2026 at 11:50:46AM +0100, Peter Zijlstra wrote:
>>> On Sun, Jan 18, 2026 at 03:46:22PM -0500, Mario Roy wrote:
>>>> The patch "Proportional newidle balance" introduced a regression
>>>> with Linux 6.12.65 and 6.18.5. There is noticeable regression with
>>>> easyWave testing. [1]
>>>>
>>>> The CPU is AMD Threadripper 9960X CPU (24/48). I followed the source
>>>> to install easyWave [2]. That is fetching the two tar.gz archives.
>>>
>>> What is the actual configuration of that chip? Is it like 3*8 or 4*6
>>> (CCX wise). A quick google couldn't find me the answer :/
>>
>> Obviously I found it right after sending this. It's a 4x6 config.
>> Meaning it needs newidle to balance between those 4 domains.
>>
>> Pratheek -- are you guys still considering that SIS_NODE thing? That
>> worked really well for workstation chips, but there were some issues on
>> Epyc or so.
>
> SIS_NODE was really turned out to be a trade-off between search
> time vs search opportunity, especially when the system was heavily
> overloaded.
>
> Let me rebase those old patches and give it a spin over the weekend
> on a couple of those large machines (128C/256T and 192C/384T per
> socket) to see the damage. I'll update here by Tuesday or post out
> a series if I see the situation having changed on the recent
> kernels - some benchmarks had a completely different bottleneck
> there when we looked closer last.
So these are the results on tip:sched/core merged onto tip:sched/urgent
with SIS_NODE and SIS_NODE + SIS_UTIL [1] on a 512 CPUs machine with
(2 sockets x 16 CCXs (LLCs) x 8C/16T Zen4c cores):
tl;dr
(*) Consistent regressions, even with SIS_UTIL bailout on higher domain;
Benchmark are mainly measuring tail-latency or have a thundering
heard behavior that SIS_UTIL uwith default imbalance_pct isn't able
to fully adjust to.
(#) Data has run-to-run variance but is still worse on average.
Note: Although "new-schbench-wakeup-latency" shows regression, the
baseline is few "us" and a couple more "us" addition appears as a
~ 20%-30% regression.
I'm still fighting dependency hell to get some of the longer running
benchmarks running on this system but I expect a few pct regressions
like last time [2].
System:
- 2 x 128C/256T Zen4c system with 16CCXs per socket
- Boost on
- C2 disabled
- Each socket is a NUMA node
Kernels:
tip: tip:sched/core at commit 377521af0341 ("sched: remove
task_struct->faults_disabled_mapping") merged onto
tip:sched/urgent at commit 15257cc2f905 ("sched/fair: Revert
force wakeup preemption")
sis_node: tip + sis_node patch + cpumask_and() moved to after
SIS_UTIL bailout [3]
sis_node: Tree from [1] based on tip:sched/core merged onto
tip:sched/urgent
Full results:
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) sis-node[pct imp](CV) sis-node-w-sis-util[pct imp](CV)
1-groups 1.00 [ -0.00](11.61) 0.76 [ 24.30]( 4.43) 0.76 [ 24.05]( 2.93)
2-groups 1.00 [ -0.00]( 9.73) 0.86 [ 14.22](17.59) 0.80 [ 19.85](15.31)
4-groups 1.00 [ -0.00]( 5.88) 0.78 [ 21.87](11.93) 0.78 [ 21.64](14.33)
8-groups 1.00 [ -0.00]( 2.93) 0.92 [ 8.44]( 3.99) 0.92 [ 7.79]( 4.04)
16-groups 1.00 [ -0.00]( 1.77) 0.90 [ 10.47]( 5.61) 0.94 [ 5.92]( 5.65)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) sis-node[pct imp](CV) sis-node-w-sis-util[pct imp](CV)
1 1.00 [ 0.00]( 0.20) 1.00 [ -0.07]( 0.16) 1.01 [ 0.53]( 0.23)
2 1.00 [ 0.00]( 0.35) 1.00 [ -0.03]( 0.58) 1.00 [ 0.12]( 0.20)
4 1.00 [ 0.00]( 0.09) 1.01 [ 0.60]( 0.60) 1.00 [ 0.16]( 0.15)
8 1.00 [ 0.00]( 0.63) 1.00 [ -0.35]( 0.53) 1.00 [ 0.26]( 0.19)
16 1.00 [ 0.00]( 0.97) 1.00 [ 0.33]( 0.30) 1.01 [ 1.16]( 0.50)
32 1.00 [ 0.00]( 0.98) 1.02 [ 1.54]( 0.91) 1.01 [ 1.10]( 0.26)
64 1.00 [ 0.00]( 3.45) 1.02 [ 1.88]( 0.48) 1.02 [ 1.78]( 1.29)
128 1.00 [ 0.00]( 2.49) 1.00 [ -0.01]( 1.63) 0.99 [ -0.68]( 1.88)
256 1.00 [ 0.00]( 0.57) 1.01 [ 0.73]( 0.45) 1.01 [ 0.92]( 0.35)
512 1.00 [ 0.00]( 3.92) 0.51 [-48.55]( 0.11) 0.80 [-19.59]( 6.31) (*)
1024 1.00 [ 0.00]( 0.10) 0.98 [ -2.11]( 0.09) 0.97 [ -3.29]( 0.28)
2048 1.00 [ 0.00]( 0.09) 0.98 [ -2.08]( 0.28) 0.99 [ -0.75]( 0.48)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) sis-node[pct imp](CV) sis-node-w-sis-util[pct imp](CV)
Copy 1.00 [ 0.00]( 0.31) 0.99 [ -0.70]( 0.57) 1.00 [ -0.09]( 1.44)
Scale 1.00 [ 0.00]( 0.38) 0.99 [ -1.00]( 0.49) 1.00 [ 0.32]( 1.41)
Add 1.00 [ 0.00]( 0.31) 0.99 [ -0.95]( 0.63) 1.00 [ 0.43]( 1.16)
Triad 1.00 [ 0.00]( 0.18) 0.99 [ -0.84]( 0.68) 1.00 [ 0.16]( 1.12)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) sis-node[pct imp](CV) sis-node-w-sis-util[pct imp](CV)
Copy 1.00 [ 0.00]( 1.46) 1.00 [ 0.39]( 1.57) 1.01 [ 0.82]( 0.52)
Scale 1.00 [ 0.00]( 1.45) 1.00 [ 0.49]( 1.37) 1.01 [ 1.20]( 0.55)
Add 1.00 [ 0.00]( 1.09) 1.00 [ 0.31]( 0.94) 1.01 [ 0.79]( 0.35)
Triad 1.00 [ 0.00]( 1.06) 1.00 [ 0.22]( 1.02) 1.01 [ 0.56]( 0.19)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) sis-node[pct imp](CV) sis-node-w-sis-util[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.27) 0.99 [ -0.82]( 0.26) 0.99 [ -0.78]( 0.16)
2-clients 1.00 [ 0.00]( 0.28) 0.99 [ -0.87]( 0.19) 1.00 [ -0.17]( 0.67)
4-clients 1.00 [ 0.00]( 0.38) 1.00 [ -0.47]( 0.33) 0.99 [ -0.53]( 0.31)
8-clients 1.00 [ 0.00]( 0.34) 0.99 [ -0.55]( 0.18) 1.00 [ -0.33]( 0.24)
16-clients 1.00 [ 0.00]( 0.30) 1.00 [ -0.39]( 0.23) 1.00 [ -0.19]( 0.26)
32-clients 1.00 [ 0.00]( 0.43) 1.00 [ -0.40]( 0.57) 1.00 [ -0.24]( 0.68)
64-clients 1.00 [ 0.00]( 0.82) 1.00 [ -0.12]( 0.45) 1.00 [ -0.14]( 0.70)
128-clients 1.00 [ 0.00]( 1.21) 1.00 [ 0.10]( 1.28) 1.00 [ 0.08]( 1.19)
256-clients 1.00 [ 0.00]( 1.38) 1.01 [ 0.65]( 0.89) 1.00 [ 0.34]( 0.89)
512-clients 1.00 [ 0.00]( 8.76) 0.47 [-52.76]( 1.64) 0.77 [-23.10](10.06) (*)
768-clients 1.00 [ 0.00](34.29) 0.83 [-16.89](30.45) 0.98 [ -2.16](36.19)
1024-clients 1.00 [ 0.00](47.96) 0.91 [ -9.29](36.02) 0.98 [ -1.93](46.36)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sis-node[pct imp](CV) sis-node-w-sis-util[pct imp](CV)
1 1.00 [ -0.00](14.20) 1.72 [-72.00](15.01) 0.88 [ 12.00]( 4.55)
2 1.00 [ -0.00]( 1.68) 1.09 [ -8.82]( 6.96) 0.97 [ 2.94]( 9.90)
4 1.00 [ -0.00]( 4.45) 1.18 [-17.65]( 5.29) 1.03 [ -2.94]( 3.24)
8 1.00 [ -0.00]( 2.44) 1.12 [-12.20]( 4.35) 1.02 [ -2.44]( 2.38)
16 1.00 [ -0.00]( 0.00) 1.04 [ -3.64]( 1.75) 0.98 [ 1.82]( 1.85)
32 1.00 [ -0.00]( 2.87) 1.03 [ -2.53]( 2.80) 0.99 [ 1.27]( 1.47)
64 1.00 [ -0.00]( 3.17) 1.02 [ -1.57]( 5.72) 0.98 [ 2.36]( 2.30)
128 1.00 [ -0.00]( 2.95) 1.01 [ -1.35]( 3.03) 1.00 [ -0.00]( 1.13)
256 1.00 [ -0.00]( 1.17) 0.99 [ 1.23]( 1.75) 0.99 [ 1.43]( 1.56)
512 1.00 [ -0.00]( 4.54) 1.14 [-13.60]( 2.41) 0.97 [ 2.50]( 0.42)
768 1.00 [ -0.00]( 2.24) 1.27 [-27.44]( 3.18) 1.12 [-11.54]( 5.64) (*)
1024 1.00 [ -0.00]( 0.28) 1.14 [-14.20]( 0.56) 1.13 [-13.00]( 1.01) (*)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sis-node[pct imp](CV) sis-node-w-sis-util[pct imp](CV)
1 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.15)
2 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.15)
4 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.29]( 0.15)
8 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.29]( 0.00)
16 1.00 [ 0.00]( 0.15) 1.00 [ -0.29]( 0.15) 1.00 [ 0.00]( 0.00)
32 1.00 [ 0.00]( 0.15) 1.00 [ -0.29]( 0.00) 1.00 [ 0.00]( 0.15)
64 1.00 [ 0.00]( 0.00) 1.00 [ 0.00]( 0.00) 1.00 [ 0.29]( 0.00)
128 1.00 [ 0.00]( 0.27) 1.00 [ 0.00](18.48) 0.65 [-34.50](24.12) (#)
256 1.00 [ 0.00]( 0.00) 0.99 [ -0.58]( 0.00) 0.99 [ -0.58]( 0.00)
512 1.00 [ 0.00]( 1.05) 1.00 [ 0.00]( 0.20) 1.00 [ 0.39]( 0.87)
768 1.00 [ 0.00]( 0.95) 0.98 [ -1.88]( 0.93) 0.99 [ -0.71]( 0.53)
1024 1.00 [ 0.00]( 0.49) 0.99 [ -0.81]( 0.57) 1.00 [ 0.00]( 0.74)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sis-node[pct imp](CV) sis-node-w-sis-util[pct imp](CV)
1 1.00 [ -0.00]( 6.74) 2.38 [-137.50](29.34) 1.75 [-75.00]( 9.53)
2 1.00 [ -0.00](12.06) 1.27 [-27.27]( 9.53) 1.36 [-36.36]( 6.59)
4 1.00 [ -0.00](11.71) 1.33 [-33.33]( 3.30) 1.33 [-33.33]( 3.16)
8 1.00 [ -0.00]( 0.00) 1.27 [-27.27](12.69) 1.09 [ -9.09]( 4.43)
16 1.00 [ -0.00]( 4.84) 1.09 [ -9.09]( 4.43) 1.18 [-18.18](10.79)
32 1.00 [ -0.00]( 0.00) 1.00 [ -0.00]( 0.00) 1.10 [-10.00]( 4.56)
64 1.00 [ -0.00](13.22) 1.00 [ -0.00]( 5.00) 1.00 [ -0.00]( 9.68)
128 1.00 [ -0.00]( 8.13) 1.00 [ -0.00]( 8.85) 1.18 [-18.18](13.76)
256 1.00 [ -0.00]( 2.97) 1.02 [ -1.94]( 3.80) 1.08 [ -7.77]( 7.13)
512 1.00 [ -0.00]( 1.25) 1.00 [ 0.37]( 0.68) 1.00 [ -0.37]( 1.81)
768 1.00 [ -0.00]( 0.00) 1.00 [ -0.00]( 0.00) 1.00 [ -0.00]( 0.00)
1024 1.00 [ -0.00]( 0.63) 1.00 [ -0.11]( 4.06) 1.00 [ -0.11]( 3.13)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) sis-node[pct imp](CV) sis-node-w-sis-util[pct imp](CV)
1 1.00 [ -0.00]( 0.14) 1.00 [ -0.26]( 0.14) 1.00 [ -0.00]( 0.14)
2 1.00 [ -0.00]( 0.14) 1.00 [ -0.26]( 0.00) 1.00 [ -0.00]( 0.14)
4 1.00 [ -0.00]( 0.00) 1.00 [ -0.00]( 0.00) 1.00 [ 0.26]( 0.14)
8 1.00 [ -0.00]( 0.00) 1.00 [ -0.00]( 0.00) 1.00 [ 0.26]( 0.14)
16 1.00 [ -0.00]( 0.00) 1.00 [ -0.00]( 0.00) 1.01 [ -0.53]( 1.18)
32 1.00 [ -0.00]( 0.54) 1.01 [ -1.05]( 0.59) 0.99 [ 0.53]( 0.27)
64 1.00 [ -0.00]( 0.00) 1.00 [ 0.26]( 1.08) 1.00 [ 0.26](31.75)
128 1.00 [ -0.00]( 0.61) 1.00 [ -0.00]( 4.19) 1.10 [-10.22]( 4.79) (#)
256 1.00 [ -0.00]( 0.43) 1.01 [ -1.39]( 0.74) 1.02 [ -1.63]( 0.66)
512 1.00 [ -0.00]( 3.32) 1.00 [ 0.23]( 1.62) 1.04 [ -3.72]( 3.79)
768 1.00 [ -0.00]( 0.88) 0.95 [ 4.52]( 0.63) 0.98 [ 1.94]( 0.54)
1024 1.00 [ -0.00]( 1.01) 0.98 [ 1.54]( 0.91) 1.00 [ 0.17]( 0.31)
Let me go play around with imbalance_pct for SIS_UITL at PKG/NODE domain
to see if there is a sweet spot that keeps everything happy while things
are happier on average.
I doubt if Meta's workload will be happy with more aggressive SIS_UTIL
limits since data from David's SHARED_RUNQ series [4] showed that
specific workload requires aggressive search + aggressive newidle balance.
References:
[1] https://github.com/kudureranganath/linux/commits/kudure/sched/sis_node/
[2] https://lore.kernel.org/all/3de5c24f-6437-f21b-ed61-76b86a199e8c@amd.com/
[3] https://github.com/kudureranganath/linux/commit/7639cf7632853b91e6a5b449eee08d3399b10d31
[4] https://lore.kernel.org/lkml/20230809221218.163894-1-void@manifault.com/
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists