[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1dd0ea0b-4515-4507-9b50-75de87fee377@intel.com>
Date: Tue, 24 Jun 2025 20:16:02 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>, Abel Wu
<wuyun.abel@...edance.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
<linux-kernel@...r.kernel.org>, Tim Chen <tim.c.chen@...ux.intel.com>, "Peter
Zijlstra" <peterz@...radead.org>, "Gautham R . Shenoy"
<gautham.shenoy@....com>, Ingo Molnar <mingo@...hat.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
On 6/24/2025 1:00 PM, K Prateek Nayak wrote:
> Hello Tim,
>
> On 6/18/2025 11:57 PM, Tim Chen wrote:
>> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
>> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
>> with 1 group test scenario benefits from cache aware load balance
>> too:
>>
>> hackbench(1 group and fd ranges in [1,6]:
>> case load baseline(std%) compare%( std%)
>> threads-pipe-1 1-groups 1.00 ( 1.22) +2.84 ( 0.51)
>> threads-pipe-2 1-groups 1.00 ( 5.82) +42.82 ( 43.61)
>> threads-pipe-3 1-groups 1.00 ( 3.49) +17.33 ( 18.68)
>> threads-pipe-4 1-groups 1.00 ( 2.49) +12.49 ( 5.89)
>> threads-pipe-5 1-groups 1.00 ( 1.46) +8.62 ( 4.43)
>> threads-pipe-6 1-groups 1.00 ( 2.83) +12.73 ( 8.94)
>> threads-sockets-1 1-groups 1.00 ( 1.31) +28.68 ( 2.25)
>> threads-sockets-2 1-groups 1.00 ( 5.17) +34.84 ( 36.90)
>> threads-sockets-3 1-groups 1.00 ( 1.57) +9.15 ( 5.52)
>> threads-sockets-4 1-groups 1.00 ( 1.99) +16.51 ( 6.04)
>> threads-sockets-5 1-groups 1.00 ( 2.39) +10.88 ( 2.17)
>> threads-sockets-6 1-groups 1.00 ( 1.62) +7.22 ( 2.00)
>>
>> Besides a single instance of hackbench, four instances of hackbench are
>> also tested on Milan. The test results show that different instances of
>> hackbench are aggregated to dedicated LLCs, and performance improvement
>> is observed.
>>
>> schbench mmtests(unstable)
>> baseline nowake_lb
>> Lat 50.0th-qrtle-1 9.00 ( 0.00%) 8.00 ( 11.11%)
>> Lat 90.0th-qrtle-1 12.00 ( 0.00%) 10.00 ( 16.67%)
>> Lat 99.0th-qrtle-1 16.00 ( 0.00%) 14.00 ( 12.50%)
>> Lat 99.9th-qrtle-1 22.00 ( 0.00%) 21.00 ( 4.55%)
>> Lat 20.0th-qrtle-1 759.00 ( 0.00%) 759.00 ( 0.00%)
>> Lat 50.0th-qrtle-2 9.00 ( 0.00%) 7.00 ( 22.22%)
>> Lat 90.0th-qrtle-2 12.00 ( 0.00%) 12.00 ( 0.00%)
>> Lat 99.0th-qrtle-2 16.00 ( 0.00%) 15.00 ( 6.25%)
>> Lat 99.9th-qrtle-2 22.00 ( 0.00%) 21.00 ( 4.55%)
>> Lat 20.0th-qrtle-2 1534.00 ( 0.00%) 1510.00 ( 1.56%)
>> Lat 50.0th-qrtle-4 8.00 ( 0.00%) 9.00 ( -12.50%)
>> Lat 90.0th-qrtle-4 12.00 ( 0.00%) 12.00 ( 0.00%)
>> Lat 99.0th-qrtle-4 15.00 ( 0.00%) 16.00 ( -6.67%)
>> Lat 99.9th-qrtle-4 21.00 ( 0.00%) 23.00 ( -9.52%)
>> Lat 20.0th-qrtle-4 3076.00 ( 0.00%) 2860.00 ( 7.02%)
>> Lat 50.0th-qrtle-8 10.00 ( 0.00%) 9.00 ( 10.00%)
>> Lat 90.0th-qrtle-8 12.00 ( 0.00%) 13.00 ( -8.33%)
>> Lat 99.0th-qrtle-8 17.00 ( 0.00%) 17.00 ( 0.00%)
>> Lat 99.9th-qrtle-8 22.00 ( 0.00%) 24.00 ( -9.09%)
>> Lat 20.0th-qrtle-8 6232.00 ( 0.00%) 5896.00 ( 5.39%)
>> Lat 50.0th-qrtle-16 9.00 ( 0.00%) 9.00 ( 0.00%)
>> Lat 90.0th-qrtle-16 13.00 ( 0.00%) 13.00 ( 0.00%)
>> Lat 99.0th-qrtle-16 17.00 ( 0.00%) 18.00 ( -5.88%)
>> Lat 99.9th-qrtle-16 23.00 ( 0.00%) 26.00 ( -13.04%)
>> Lat 20.0th-qrtle-16 10096.00 ( 0.00%) 10352.00 ( -2.54%)
>> Lat 50.0th-qrtle-32 15.00 ( 0.00%) 15.00 ( 0.00%)
>> Lat 90.0th-qrtle-32 25.00 ( 0.00%) 26.00 ( -4.00%)
>> Lat 99.0th-qrtle-32 49.00 ( 0.00%) 50.00 ( -2.04%)
>> Lat 99.9th-qrtle-32 945.00 ( 0.00%) 1005.00 ( -6.35%)
>> Lat 20.0th-qrtle-32 11600.00 ( 0.00%) 11632.00 ( -0.28%)
>>
>> Netperf/Tbench have not been tested yet. As they are single-process
>> benchmarks that are not the target of this cache-aware scheduling.
>> Additionally, client and server components should be tested on
>> different machines or bound to different nodes. Otherwise,
>> cache-aware scheduling might harm their performance: placing client
>> and server in the same LLC could yield higher throughput due to
>> improved cache locality in the TCP/IP stack, whereas cache-aware
>> scheduling aims to place them in dedicated LLCs.
>
> I have similar observation from my testing.
>
Prateek, thanks for your test.
> tl;dr
>
> o Benchmark that prefer co-location and run in threaded mode see
> a benefit including hackbench at high utilization and schbench
> at low utilization.
>
Previously, we tested hackbench with one group using different
fd pairs. The number of fds (1–6) was lower than the number
of CPUs (8) within one CCX. If I understand correctly, the
default number of fd pairs in hackbench is 20. We might need
to handle cases where the number of threads (nr_thread)
exceeds the number of CPUs per LLC—perhaps by
skipping task aggregation in such scenarios.
> o schbench (both new and old but particularly the old) regresses
> quite a bit on the tial latency metric when #workers cross the
> LLC size.
>
As mentioned above, maybe re-consider the nr_thread vs nr_cpus_per_llc
could mitigate the issue. Besides, maybe introduce a rate limit
for cache aware aggregation would help.
> o client-server benchmarks where client and servers are threads
> from different processes (netserver-netperf, tbench_srv-tbench,
> services of DeathStarBench) seem to noticeably regress due to
> lack of co-location between the communicating client and server.
>
> Not sure if WF_SYNC can be an indicator to temporarily ignore
> the preferred LLC hint.
WF_SYNC is used in wakeup path, the current v3 version does the
task aggregation in the load balance path. We'll look into this
C/S scenario.
>
> o stream regresses in some runs where the occupancy metrics trip
> and assign a preferred LLC for all the stream threads bringing
> down performance in !50% of the runs.
>
May I know if you tested the stream with mmtests under OMP mode,
and what do stream-10 and stream-100 mean? Stream is an example
where all threads have their private memory buffers—no
interaction with each other. For this benchmark, spreading
them across different Nodes gets higher memory bandwidth because
stream allocates the buffer to be at least 4X the L3 cache size.
We lack a metric that can indicate when threads share a lot of
data (e.g., both Thread 1 and Thread 2 read from the same
buffer). In such cases, we should aggregate the threads;
otherwise, do not aggregate them (as in the stream case).
On the other hand, stream-omp seems like an unrealistic
scenario—if threads do not share buffer, why create them
in the same process?
> Full data from my testing is as follows:
>
> o Machine details
>
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
>
> o Kernel details
>
> tip: tip:sched/core at commit 914873bc7df9 ("Merge tag
> 'x86-build-2025-05-25' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
>
> llc-aware-lb-v3: tip + this series as is
>
> o Benchmark results
>
> ==================================================================
> Test : hackbench
> Units : Normalized time in seconds
> Interpretation: Lower is better
> Statistic : AMean
> ==================================================================
> Case: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
> 1-groups 1.00 [ -0.00](13.74) 1.03 [ -2.77](12.01)
> 2-groups 1.00 [ -0.00]( 9.58) 1.02 [ -1.78]( 6.12)
> 4-groups 1.00 [ -0.00]( 2.10) 1.01 [ -0.87]( 0.91)
> 8-groups 1.00 [ -0.00]( 1.51) 1.03 [ -3.31]( 2.06)
> 16-groups 1.00 [ -0.00]( 1.10) 0.95 [ 5.36]( 1.67)
>
>
> ==================================================================
> Test : tbench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
> 1 1.00 [ 0.00]( 0.82) 0.96 [ -3.68]( 1.23)
> 2 1.00 [ 0.00]( 1.13) 0.98 [ -2.30]( 0.51)
> 4 1.00 [ 0.00]( 1.12) 0.96 [ -4.14]( 0.22)
> 8 1.00 [ 0.00]( 0.93) 0.96 [ -3.61]( 0.46)
> 16 1.00 [ 0.00]( 0.38) 0.95 [ -4.98]( 1.26)
> 32 1.00 [ 0.00]( 0.66) 0.93 [ -7.12]( 2.22)
> 64 1.00 [ 0.00]( 1.18) 0.95 [ -5.44]( 0.37)
> 128 1.00 [ 0.00]( 1.12) 0.93 [ -6.78]( 0.64)
> 256 1.00 [ 0.00]( 0.42) 0.94 [ -6.45]( 0.47)
> 512 1.00 [ 0.00]( 0.14) 0.93 [ -7.26]( 0.27)
> 1024 1.00 [ 0.00]( 0.26) 0.92 [ -7.57]( 0.31)
>
>
> ==================================================================
> Test : stream-10
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
> Copy 1.00 [ 0.00]( 8.37) 0.39 [-61.05](44.88)
> Scale 1.00 [ 0.00]( 2.85) 0.43 [-57.26](40.60)
> Add 1.00 [ 0.00]( 3.39) 0.40 [-59.88](42.02)
> Triad 1.00 [ 0.00]( 6.39) 0.41 [-58.93](42.98)
>
>
> ==================================================================
> Test : stream-100
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
> Copy 1.00 [ 0.00]( 3.91) 0.36 [-63.95](51.04)
> Scale 1.00 [ 0.00]( 4.34) 0.40 [-60.31](43.12)
> Add 1.00 [ 0.00]( 4.14) 0.38 [-62.46](43.40)
> Triad 1.00 [ 0.00]( 1.00) 0.36 [-64.38](43.12)
>
>
> ==================================================================
> Test : netperf
> Units : Normalized Througput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
> 1-clients 1.00 [ 0.00]( 0.41) 0.97 [ -3.26]( 1.30)
> 2-clients 1.00 [ 0.00]( 0.58) 0.96 [ -4.24]( 0.71)
> 4-clients 1.00 [ 0.00]( 0.35) 0.96 [ -4.19]( 0.67)
> 8-clients 1.00 [ 0.00]( 0.48) 0.95 [ -5.41]( 1.36)
> 16-clients 1.00 [ 0.00]( 0.66) 0.95 [ -5.31]( 0.93)
> 32-clients 1.00 [ 0.00]( 1.15) 0.94 [ -6.43]( 1.44)
> 64-clients 1.00 [ 0.00]( 1.38) 0.93 [ -7.14]( 1.63)
> 128-clients 1.00 [ 0.00]( 0.87) 0.89 [-10.62]( 0.78)
> 256-clients 1.00 [ 0.00]( 5.36) 0.92 [ -8.04]( 2.64)
> 512-clients 1.00 [ 0.00](54.39) 0.88 [-12.12](48.87)
>
>
> ==================================================================
> Test : schbench
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
> 1 1.00 [ -0.00]( 8.54) 0.54 [ 45.65](28.79)
> 2 1.00 [ -0.00]( 1.15) 0.56 [ 44.00]( 2.09)
> 4 1.00 [ -0.00](13.46) 0.67 [ 33.33](35.68)
> 8 1.00 [ -0.00]( 7.14) 0.63 [ 36.84]( 4.28)
> 16 1.00 [ -0.00]( 3.49) 1.05 [ -5.08]( 9.13)
> 32 1.00 [ -0.00]( 1.06) 32.04 [-3104.26](81.31)
> 64 1.00 [ -0.00]( 5.48) 24.51 [-2351.16](81.18)
> 128 1.00 [ -0.00](10.45) 14.56 [-1356.07]( 5.35)
> 256 1.00 [ -0.00](31.14) 0.95 [ 4.80](20.88)
> 512 1.00 [ -0.00]( 1.52) 1.00 [ -0.25]( 1.26)
>
>
> ==================================================================
> Test : new-schbench-requests-per-second
> Units : Normalized Requests per second
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
> 1 1.00 [ 0.00]( 1.07) 0.97 [ -3.24]( 0.98)
> 2 1.00 [ 0.00]( 0.00) 0.99 [ -1.17]( 0.15)
> 4 1.00 [ 0.00]( 0.00) 0.96 [ -3.50]( 0.56)
> 8 1.00 [ 0.00]( 0.15) 0.98 [ -1.76]( 0.31)
> 16 1.00 [ 0.00]( 0.00) 0.94 [ -6.13]( 1.93)
> 32 1.00 [ 0.00]( 3.41) 0.97 [ -3.18]( 2.10)
> 64 1.00 [ 0.00]( 1.05) 0.82 [-18.14](18.41)
> 128 1.00 [ 0.00]( 0.00) 0.98 [ -2.27]( 0.20)
> 256 1.00 [ 0.00]( 0.72) 1.01 [ 1.23]( 0.31)
> 512 1.00 [ 0.00]( 0.57) 1.00 [ 0.00]( 0.12)
>
>
> ==================================================================
> Test : new-schbench-wakeup-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
> 1 1.00 [ -0.00]( 9.11) 0.88 [ 12.50](11.92)
> 2 1.00 [ -0.00]( 0.00) 0.86 [ 14.29](11.92)
> 4 1.00 [ -0.00]( 3.78) 0.93 [ 7.14]( 4.08)
> 8 1.00 [ -0.00]( 0.00) 0.83 [ 16.67]( 5.34)
> 16 1.00 [ -0.00]( 7.56) 0.85 [ 15.38]( 0.00)
> 32 1.00 [ -0.00](15.11) 0.80 [ 20.00]( 4.19)
> 64 1.00 [ -0.00]( 9.63) 1.05 [ -5.00](24.47)
> 128 1.00 [ -0.00]( 4.86) 1.57 [-56.78](68.52)
> 256 1.00 [ -0.00]( 2.34) 1.00 [ -0.00]( 0.57)
> 512 1.00 [ -0.00]( 0.40) 1.00 [ -0.00]( 0.34)
>
>
> ==================================================================
> Test : new-schbench-request-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) llc-aware-lb-v3[pct imp](CV)
> 1 1.00 [ -0.00]( 2.73) 1.06 [ -5.71]( 0.25)
> 2 1.00 [ -0.00]( 0.87) 1.08 [ -8.37]( 0.78)
> 4 1.00 [ -0.00]( 1.21) 1.09 [ -9.15]( 0.79)
> 8 1.00 [ -0.00]( 0.27) 1.06 [ -6.31]( 0.51)
> 16 1.00 [ -0.00]( 4.04) 1.85 [-84.55]( 5.11)
> 32 1.00 [ -0.00]( 7.35) 1.52 [-52.16]( 0.83)
> 64 1.00 [ -0.00]( 3.54) 1.06 [ -5.77]( 2.62)
> 128 1.00 [ -0.00]( 0.37) 1.09 [ -9.18](28.47)
> 256 1.00 [ -0.00]( 9.57) 0.99 [ 0.60]( 0.48)
> 512 1.00 [ -0.00]( 1.82) 1.03 [ -2.80]( 1.16)
>
>
> ==================================================================
> Test : Various longer running benchmarks
> Units : %diff in throughput reported
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> Benchmarks: %diff
> ycsb-cassandra -0.99%
> ycsb-mongodb -0.96%
> deathstarbench-1x -2.09%
> deathstarbench-2x -0.26%
> deathstarbench-3x -3.34%
> deathstarbench-6x -3.03%
> hammerdb+mysql 16VU -2.15%
> hammerdb+mysql 64VU -3.77%
>
>>
>> This patch set is applied on v6.15 kernel.
>> There are some further work needed for future versions in this
>> patch set. We will need to align NUMA balancing with LLC aggregations
>> such that LLC aggregation will align with the preferred NUMA node.
>>
>> Comments and tests are much appreciated.
>
> I'll rerun the test once with the SCHED_FEAT() disabled just to make
> sure I'm not regressing because of some other factors. For the major
> regressions, I'll get the "perf sched stats" data to see if anything
> stands out.
It seems that task migration and task bouncing between its preferred
LLC and non-preferred LLC is one symptom that caused regression.
thanks,
Chenyu
>
> I'm also planning on getting the data from a Zen5c system with larger
> LLC to see if there is any difference in the trend (I'll start with the
> microbenchmarks since setting the larger ones will take some time)
>
> Sorry for the lack of engagement on previous versions but I plan on
> taking a better look at the series this time around. If you need any
> specific data from my setup, please do let me know.
>
Powered by blists - more mailing lists