[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4cde5b36-4ef3-4dc8-a540-99287d621c7f@amd.com>
Date: Tue, 24 Jun 2025 10:30:44 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Tim Chen <tim.c.chen@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>,
Abel Wu <wuyun.abel@...edance.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
linux-kernel@...r.kernel.org, Chen Yu <yu.c.chen@...el.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
Hello Tim,
On 6/18/2025 11:57 PM, Tim Chen wrote:
> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
> with 1 group test scenario benefits from cache aware load balance
> too:
>
> hackbench(1 group and fd ranges in [1,6]:
> case load baseline(std%) compare%( std%)
> threads-pipe-1 1-groups 1.00 ( 1.22) +2.84 ( 0.51)
> threads-pipe-2 1-groups 1.00 ( 5.82) +42.82 ( 43.61)
> threads-pipe-3 1-groups 1.00 ( 3.49) +17.33 ( 18.68)
> threads-pipe-4 1-groups 1.00 ( 2.49) +12.49 ( 5.89)
> threads-pipe-5 1-groups 1.00 ( 1.46) +8.62 ( 4.43)
> threads-pipe-6 1-groups 1.00 ( 2.83) +12.73 ( 8.94)
> threads-sockets-1 1-groups 1.00 ( 1.31) +28.68 ( 2.25)
> threads-sockets-2 1-groups 1.00 ( 5.17) +34.84 ( 36.90)
> threads-sockets-3 1-groups 1.00 ( 1.57) +9.15 ( 5.52)
> threads-sockets-4 1-groups 1.00 ( 1.99) +16.51 ( 6.04)
> threads-sockets-5 1-groups 1.00 ( 2.39) +10.88 ( 2.17)
> threads-sockets-6 1-groups 1.00 ( 1.62) +7.22 ( 2.00)
>
> Besides a single instance of hackbench, four instances of hackbench are
> also tested on Milan. The test results show that different instances of
> hackbench are aggregated to dedicated LLCs, and performance improvement
> is observed.
>
> schbench mmtests(unstable)
> baseline nowake_lb
> Lat 50.0th-qrtle-1 9.00 ( 0.00%) 8.00 ( 11.11%)
> Lat 90.0th-qrtle-1 12.00 ( 0.00%) 10.00 ( 16.67%)
> Lat 99.0th-qrtle-1 16.00 ( 0.00%) 14.00 ( 12.50%)
> Lat 99.9th-qrtle-1 22.00 ( 0.00%) 21.00 ( 4.55%)
> Lat 20.0th-qrtle-1 759.00 ( 0.00%) 759.00 ( 0.00%)
> Lat 50.0th-qrtle-2 9.00 ( 0.00%) 7.00 ( 22.22%)
> Lat 90.0th-qrtle-2 12.00 ( 0.00%) 12.00 ( 0.00%)
> Lat 99.0th-qrtle-2 16.00 ( 0.00%) 15.00 ( 6.25%)
> Lat 99.9th-qrtle-2 22.00 ( 0.00%) 21.00 ( 4.55%)
> Lat 20.0th-qrtle-2 1534.00 ( 0.00%) 1510.00 ( 1.56%)
> Lat 50.0th-qrtle-4 8.00 ( 0.00%) 9.00 ( -12.50%)
> Lat 90.0th-qrtle-4 12.00 ( 0.00%) 12.00 ( 0.00%)
> Lat 99.0th-qrtle-4 15.00 ( 0.00%) 16.00 ( -6.67%)
> Lat 99.9th-qrtle-4 21.00 ( 0.00%) 23.00 ( -9.52%)
> Lat 20.0th-qrtle-4 3076.00 ( 0.00%) 2860.00 ( 7.02%)
> Lat 50.0th-qrtle-8 10.00 ( 0.00%) 9.00 ( 10.00%)
> Lat 90.0th-qrtle-8 12.00 ( 0.00%) 13.00 ( -8.33%)
> Lat 99.0th-qrtle-8 17.00 ( 0.00%) 17.00 ( 0.00%)
> Lat 99.9th-qrtle-8 22.00 ( 0.00%) 24.00 ( -9.09%)
> Lat 20.0th-qrtle-8 6232.00 ( 0.00%) 5896.00 ( 5.39%)
> Lat 50.0th-qrtle-16 9.00 ( 0.00%) 9.00 ( 0.00%)
> Lat 90.0th-qrtle-16 13.00 ( 0.00%) 13.00 ( 0.00%)
> Lat 99.0th-qrtle-16 17.00 ( 0.00%) 18.00 ( -5.88%)
> Lat 99.9th-qrtle-16 23.00 ( 0.00%) 26.00 ( -13.04%)
> Lat 20.0th-qrtle-16 10096.00 ( 0.00%) 10352.00 ( -2.54%)
> Lat 50.0th-qrtle-32 15.00 ( 0.00%) 15.00 ( 0.00%)
> Lat 90.0th-qrtle-32 25.00 ( 0.00%) 26.00 ( -4.00%)
> Lat 99.0th-qrtle-32 49.00 ( 0.00%) 50.00 ( -2.04%)
> Lat 99.9th-qrtle-32 945.00 ( 0.00%) 1005.00 ( -6.35%)
> Lat 20.0th-qrtle-32 11600.00 ( 0.00%) 11632.00 ( -0.28%)
>
> Netperf/Tbench have not been tested yet. As they are single-process
> benchmarks that are not the target of this cache-aware scheduling.
> Additionally, client and server components should be tested on
> different machines or bound to different nodes. Otherwise,
> cache-aware scheduling might harm their performance: placing client
> and server in the same LLC could yield higher throughput due to
> improved cache locality in the TCP/IP stack, whereas cache-aware
> scheduling aims to place them in dedicated LLCs.
I have similar observation from my testing.
tl;dr
o Benchmark that prefer co-location and run in threaded mode see
a benefit including hackbench at high utilization and schbench
at low utilization.
o schbench (both new and old but particularly the old) regresses
quite a bit on the tial latency metric when #workers cross the
LLC size.
o client-server benchmarks where client and servers are threads
from different processes (netserver-netperf, tbench_srv-tbench,
services of DeathStarBench) seem to noticeably regress due to
lack of co-location between the communicating client and server.
Not sure if WF_SYNC can be an indicator to temporarily ignore
the preferred LLC hint.
o stream regresses in some runs where the occupancy metrics trip
and assign a preferred LLC for all the stream threads bringing
down performance in !50% of the runs.
Full data from my testing is as follows:
o Machine details
- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)
o Kernel details
tip: tip:sched/core at commit 914873bc7df9 ("Merge tag
'x86-build-2025-05-25' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
llc-aware-lb-v3: tip + this series as is
o Benchmark results
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
1-groups 1.00 [ -0.00](13.74) 1.03 [ -2.77](12.01)
2-groups 1.00 [ -0.00]( 9.58) 1.02 [ -1.78]( 6.12)
4-groups 1.00 [ -0.00]( 2.10) 1.01 [ -0.87]( 0.91)
8-groups 1.00 [ -0.00]( 1.51) 1.03 [ -3.31]( 2.06)
16-groups 1.00 [ -0.00]( 1.10) 0.95 [ 5.36]( 1.67)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
1 1.00 [ 0.00]( 0.82) 0.96 [ -3.68]( 1.23)
2 1.00 [ 0.00]( 1.13) 0.98 [ -2.30]( 0.51)
4 1.00 [ 0.00]( 1.12) 0.96 [ -4.14]( 0.22)
8 1.00 [ 0.00]( 0.93) 0.96 [ -3.61]( 0.46)
16 1.00 [ 0.00]( 0.38) 0.95 [ -4.98]( 1.26)
32 1.00 [ 0.00]( 0.66) 0.93 [ -7.12]( 2.22)
64 1.00 [ 0.00]( 1.18) 0.95 [ -5.44]( 0.37)
128 1.00 [ 0.00]( 1.12) 0.93 [ -6.78]( 0.64)
256 1.00 [ 0.00]( 0.42) 0.94 [ -6.45]( 0.47)
512 1.00 [ 0.00]( 0.14) 0.93 [ -7.26]( 0.27)
1024 1.00 [ 0.00]( 0.26) 0.92 [ -7.57]( 0.31)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
Copy 1.00 [ 0.00]( 8.37) 0.39 [-61.05](44.88)
Scale 1.00 [ 0.00]( 2.85) 0.43 [-57.26](40.60)
Add 1.00 [ 0.00]( 3.39) 0.40 [-59.88](42.02)
Triad 1.00 [ 0.00]( 6.39) 0.41 [-58.93](42.98)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
Copy 1.00 [ 0.00]( 3.91) 0.36 [-63.95](51.04)
Scale 1.00 [ 0.00]( 4.34) 0.40 [-60.31](43.12)
Add 1.00 [ 0.00]( 4.14) 0.38 [-62.46](43.40)
Triad 1.00 [ 0.00]( 1.00) 0.36 [-64.38](43.12)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.41) 0.97 [ -3.26]( 1.30)
2-clients 1.00 [ 0.00]( 0.58) 0.96 [ -4.24]( 0.71)
4-clients 1.00 [ 0.00]( 0.35) 0.96 [ -4.19]( 0.67)
8-clients 1.00 [ 0.00]( 0.48) 0.95 [ -5.41]( 1.36)
16-clients 1.00 [ 0.00]( 0.66) 0.95 [ -5.31]( 0.93)
32-clients 1.00 [ 0.00]( 1.15) 0.94 [ -6.43]( 1.44)
64-clients 1.00 [ 0.00]( 1.38) 0.93 [ -7.14]( 1.63)
128-clients 1.00 [ 0.00]( 0.87) 0.89 [-10.62]( 0.78)
256-clients 1.00 [ 0.00]( 5.36) 0.92 [ -8.04]( 2.64)
512-clients 1.00 [ 0.00](54.39) 0.88 [-12.12](48.87)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
1 1.00 [ -0.00]( 8.54) 0.54 [ 45.65](28.79)
2 1.00 [ -0.00]( 1.15) 0.56 [ 44.00]( 2.09)
4 1.00 [ -0.00](13.46) 0.67 [ 33.33](35.68)
8 1.00 [ -0.00]( 7.14) 0.63 [ 36.84]( 4.28)
16 1.00 [ -0.00]( 3.49) 1.05 [ -5.08]( 9.13)
32 1.00 [ -0.00]( 1.06) 32.04 [-3104.26](81.31)
64 1.00 [ -0.00]( 5.48) 24.51 [-2351.16](81.18)
128 1.00 [ -0.00](10.45) 14.56 [-1356.07]( 5.35)
256 1.00 [ -0.00](31.14) 0.95 [ 4.80](20.88)
512 1.00 [ -0.00]( 1.52) 1.00 [ -0.25]( 1.26)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
1 1.00 [ 0.00]( 1.07) 0.97 [ -3.24]( 0.98)
2 1.00 [ 0.00]( 0.00) 0.99 [ -1.17]( 0.15)
4 1.00 [ 0.00]( 0.00) 0.96 [ -3.50]( 0.56)
8 1.00 [ 0.00]( 0.15) 0.98 [ -1.76]( 0.31)
16 1.00 [ 0.00]( 0.00) 0.94 [ -6.13]( 1.93)
32 1.00 [ 0.00]( 3.41) 0.97 [ -3.18]( 2.10)
64 1.00 [ 0.00]( 1.05) 0.82 [-18.14](18.41)
128 1.00 [ 0.00]( 0.00) 0.98 [ -2.27]( 0.20)
256 1.00 [ 0.00]( 0.72) 1.01 [ 1.23]( 0.31)
512 1.00 [ 0.00]( 0.57) 1.00 [ 0.00]( 0.12)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
1 1.00 [ -0.00]( 9.11) 0.88 [ 12.50](11.92)
2 1.00 [ -0.00]( 0.00) 0.86 [ 14.29](11.92)
4 1.00 [ -0.00]( 3.78) 0.93 [ 7.14]( 4.08)
8 1.00 [ -0.00]( 0.00) 0.83 [ 16.67]( 5.34)
16 1.00 [ -0.00]( 7.56) 0.85 [ 15.38]( 0.00)
32 1.00 [ -0.00](15.11) 0.80 [ 20.00]( 4.19)
64 1.00 [ -0.00]( 9.63) 1.05 [ -5.00](24.47)
128 1.00 [ -0.00]( 4.86) 1.57 [-56.78](68.52)
256 1.00 [ -0.00]( 2.34) 1.00 [ -0.00]( 0.57)
512 1.00 [ -0.00]( 0.40) 1.00 [ -0.00]( 0.34)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
1 1.00 [ -0.00]( 2.73) 1.06 [ -5.71]( 0.25)
2 1.00 [ -0.00]( 0.87) 1.08 [ -8.37]( 0.78)
4 1.00 [ -0.00]( 1.21) 1.09 [ -9.15]( 0.79)
8 1.00 [ -0.00]( 0.27) 1.06 [ -6.31]( 0.51)
16 1.00 [ -0.00]( 4.04) 1.85 [-84.55]( 5.11)
32 1.00 [ -0.00]( 7.35) 1.52 [-52.16]( 0.83)
64 1.00 [ -0.00]( 3.54) 1.06 [ -5.77]( 2.62)
128 1.00 [ -0.00]( 0.37) 1.09 [ -9.18](28.47)
256 1.00 [ -0.00]( 9.57) 0.99 [ 0.60]( 0.48)
512 1.00 [ -0.00]( 1.82) 1.03 [ -2.80]( 1.16)
==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: %diff
ycsb-cassandra -0.99%
ycsb-mongodb -0.96%
deathstarbench-1x -2.09%
deathstarbench-2x -0.26%
deathstarbench-3x -3.34%
deathstarbench-6x -3.03%
hammerdb+mysql 16VU -2.15%
hammerdb+mysql 64VU -3.77%
>
> This patch set is applied on v6.15 kernel.
>
> There are some further work needed for future versions in this
> patch set. We will need to align NUMA balancing with LLC aggregations
> such that LLC aggregation will align with the preferred NUMA node.
>
> Comments and tests are much appreciated.
I'll rerun the test once with the SCHED_FEAT() disabled just to make
sure I'm not regressing because of some other factors. For the major
regressions, I'll get the "perf sched stats" data to see if anything
stands out.
I'm also planning on getting the data from a Zen5c system with larger
LLC to see if there is any difference in the trend (I'll start with the
microbenchmarks since setting the larger ones will take some time)
Sorry for the lack of engagement on previous versions but I plan on
taking a better look at the series this time around. If you need any
specific data from my setup, please do let me know.
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists