linux-kernel - Re: [RFC patch v3 00/20] Cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4cde5b36-4ef3-4dc8-a540-99287d621c7f@amd.com>
Date: Tue, 24 Jun 2025 10:30:44 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Tim Chen <tim.c.chen@...ux.intel.com>,
 Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
 "Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
 Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
 Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
 Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
 <vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>,
 Abel Wu <wuyun.abel@...edance.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
 Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
 linux-kernel@...r.kernel.org, Chen Yu <yu.c.chen@...el.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling

Hello Tim,

On 6/18/2025 11:57 PM, Tim Chen wrote:
> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
> with 1 group test scenario benefits from cache aware load balance
> too:
> 
> hackbench(1 group and fd ranges in [1,6]:
> case                    load            baseline(std%)  compare%( std%)
> threads-pipe-1          1-groups         1.00 (  1.22)   +2.84 (  0.51)
> threads-pipe-2          1-groups         1.00 (  5.82)  +42.82 ( 43.61)
> threads-pipe-3          1-groups         1.00 (  3.49)  +17.33 ( 18.68)
> threads-pipe-4          1-groups         1.00 (  2.49)  +12.49 (  5.89)
> threads-pipe-5          1-groups         1.00 (  1.46)   +8.62 (  4.43)
> threads-pipe-6          1-groups         1.00 (  2.83)  +12.73 (  8.94)
> threads-sockets-1       1-groups         1.00 (  1.31)  +28.68 (  2.25)
> threads-sockets-2       1-groups         1.00 (  5.17)  +34.84 ( 36.90)
> threads-sockets-3       1-groups         1.00 (  1.57)   +9.15 (  5.52)
> threads-sockets-4       1-groups         1.00 (  1.99)  +16.51 (  6.04)
> threads-sockets-5       1-groups         1.00 (  2.39)  +10.88 (  2.17)
> threads-sockets-6       1-groups         1.00 (  1.62)   +7.22 (  2.00)
> 
> Besides a single instance of hackbench, four instances of hackbench are
> also tested on Milan. The test results show that different instances of
> hackbench are aggregated to dedicated LLCs, and performance improvement
> is observed.
> 
> schbench mmtests(unstable)
>                                    baseline              nowake_lb
> Lat 50.0th-qrtle-1         9.00 (   0.00%)        8.00 (  11.11%)
> Lat 90.0th-qrtle-1        12.00 (   0.00%)       10.00 (  16.67%)
> Lat 99.0th-qrtle-1        16.00 (   0.00%)       14.00 (  12.50%)
> Lat 99.9th-qrtle-1        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-1       759.00 (   0.00%)      759.00 (   0.00%)
> Lat 50.0th-qrtle-2         9.00 (   0.00%)        7.00 (  22.22%)
> Lat 90.0th-qrtle-2        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-2        16.00 (   0.00%)       15.00 (   6.25%)
> Lat 99.9th-qrtle-2        22.00 (   0.00%)       21.00 (   4.55%)
> Lat 20.0th-qrtle-2      1534.00 (   0.00%)     1510.00 (   1.56%)
> Lat 50.0th-qrtle-4         8.00 (   0.00%)        9.00 ( -12.50%)
> Lat 90.0th-qrtle-4        12.00 (   0.00%)       12.00 (   0.00%)
> Lat 99.0th-qrtle-4        15.00 (   0.00%)       16.00 (  -6.67%)
> Lat 99.9th-qrtle-4        21.00 (   0.00%)       23.00 (  -9.52%)
> Lat 20.0th-qrtle-4      3076.00 (   0.00%)     2860.00 (   7.02%)
> Lat 50.0th-qrtle-8        10.00 (   0.00%)        9.00 (  10.00%)
> Lat 90.0th-qrtle-8        12.00 (   0.00%)       13.00 (  -8.33%)
> Lat 99.0th-qrtle-8        17.00 (   0.00%)       17.00 (   0.00%)
> Lat 99.9th-qrtle-8        22.00 (   0.00%)       24.00 (  -9.09%)
> Lat 20.0th-qrtle-8      6232.00 (   0.00%)     5896.00 (   5.39%)
> Lat 50.0th-qrtle-16        9.00 (   0.00%)        9.00 (   0.00%)
> Lat 90.0th-qrtle-16       13.00 (   0.00%)       13.00 (   0.00%)
> Lat 99.0th-qrtle-16       17.00 (   0.00%)       18.00 (  -5.88%)
> Lat 99.9th-qrtle-16       23.00 (   0.00%)       26.00 ( -13.04%)
> Lat 20.0th-qrtle-16    10096.00 (   0.00%)    10352.00 (  -2.54%)
> Lat 50.0th-qrtle-32       15.00 (   0.00%)       15.00 (   0.00%)
> Lat 90.0th-qrtle-32       25.00 (   0.00%)       26.00 (  -4.00%)
> Lat 99.0th-qrtle-32       49.00 (   0.00%)       50.00 (  -2.04%)
> Lat 99.9th-qrtle-32      945.00 (   0.00%)     1005.00 (  -6.35%)
> Lat 20.0th-qrtle-32    11600.00 (   0.00%)    11632.00 (  -0.28%)
> 
> Netperf/Tbench have not been tested yet. As they are single-process
> benchmarks that are not the target of this cache-aware scheduling.
> Additionally, client and server components should be tested on
> different machines or bound to different nodes. Otherwise,
> cache-aware scheduling might harm their performance: placing client
> and server in the same LLC could yield higher throughput due to
> improved cache locality in the TCP/IP stack, whereas cache-aware
> scheduling aims to place them in dedicated LLCs.

I have similar observation from my testing.

tl;dr

o Benchmark that prefer co-location and run in threaded mode see
   a benefit including hackbench at high utilization and schbench
   at low utilization.

o schbench (both new and old but particularly the old) regresses
   quite a bit on the tial latency metric when #workers cross the
   LLC size.

o client-server benchmarks where client and servers are threads
   from different processes (netserver-netperf, tbench_srv-tbench,
   services of DeathStarBench) seem to noticeably regress due to
   lack of co-location between the communicating client and server.

   Not sure if WF_SYNC can be an indicator to temporarily ignore
   the preferred LLC hint.

o stream regresses in some runs where the occupancy metrics trip
   and assign a preferred LLC for all the stream threads bringing
   down performance in !50% of the runs.

Full data from my testing is as follows:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Kernel details

tip:	  tip:sched/core at commit 914873bc7df9 ("Merge tag
            'x86-build-2025-05-25' of
            git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")

llc-aware-lb-v3: tip + this series as is

o Benchmark results

     ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     Case:           tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1-groups     1.00 [ -0.00](13.74)     1.03 [ -2.77](12.01)
      2-groups     1.00 [ -0.00]( 9.58)     1.02 [ -1.78]( 6.12)
      4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -0.87]( 0.91)
      8-groups     1.00 [ -0.00]( 1.51)     1.03 [ -3.31]( 2.06)
     16-groups     1.00 [ -0.00]( 1.10)     0.95 [  5.36]( 1.67)


     ==================================================================
     Test          : tbench
     Units         : Normalized throughput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:    tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
         1     1.00 [  0.00]( 0.82)     0.96 [ -3.68]( 1.23)
         2     1.00 [  0.00]( 1.13)     0.98 [ -2.30]( 0.51)
         4     1.00 [  0.00]( 1.12)     0.96 [ -4.14]( 0.22)
         8     1.00 [  0.00]( 0.93)     0.96 [ -3.61]( 0.46)
        16     1.00 [  0.00]( 0.38)     0.95 [ -4.98]( 1.26)
        32     1.00 [  0.00]( 0.66)     0.93 [ -7.12]( 2.22)
        64     1.00 [  0.00]( 1.18)     0.95 [ -5.44]( 0.37)
       128     1.00 [  0.00]( 1.12)     0.93 [ -6.78]( 0.64)
       256     1.00 [  0.00]( 0.42)     0.94 [ -6.45]( 0.47)
       512     1.00 [  0.00]( 0.14)     0.93 [ -7.26]( 0.27)
      1024     1.00 [  0.00]( 0.26)     0.92 [ -7.57]( 0.31)


     ==================================================================
     Test          : stream-10
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      Copy     1.00 [  0.00]( 8.37)     0.39 [-61.05](44.88)
     Scale     1.00 [  0.00]( 2.85)     0.43 [-57.26](40.60)
       Add     1.00 [  0.00]( 3.39)     0.40 [-59.88](42.02)
     Triad     1.00 [  0.00]( 6.39)     0.41 [-58.93](42.98)


     ==================================================================
     Test          : stream-100
     Units         : Normalized Bandwidth, MB/s
     Interpretation: Higher is better
     Statistic     : HMean
     ==================================================================
     Test:       tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      Copy     1.00 [  0.00]( 3.91)     0.36 [-63.95](51.04)
     Scale     1.00 [  0.00]( 4.34)     0.40 [-60.31](43.12)
       Add     1.00 [  0.00]( 4.14)     0.38 [-62.46](43.40)
     Triad     1.00 [  0.00]( 1.00)     0.36 [-64.38](43.12)


     ==================================================================
     Test          : netperf
     Units         : Normalized Througput
     Interpretation: Higher is better
     Statistic     : AMean
     ==================================================================
     Clients:         tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
      1-clients     1.00 [  0.00]( 0.41)     0.97 [ -3.26]( 1.30)
      2-clients     1.00 [  0.00]( 0.58)     0.96 [ -4.24]( 0.71)
      4-clients     1.00 [  0.00]( 0.35)     0.96 [ -4.19]( 0.67)
      8-clients     1.00 [  0.00]( 0.48)     0.95 [ -5.41]( 1.36)
     16-clients     1.00 [  0.00]( 0.66)     0.95 [ -5.31]( 0.93)
     32-clients     1.00 [  0.00]( 1.15)     0.94 [ -6.43]( 1.44)
     64-clients     1.00 [  0.00]( 1.38)     0.93 [ -7.14]( 1.63)
     128-clients    1.00 [  0.00]( 0.87)     0.89 [-10.62]( 0.78)
     256-clients    1.00 [  0.00]( 5.36)     0.92 [ -8.04]( 2.64)
     512-clients    1.00 [  0.00](54.39)     0.88 [-12.12](48.87)


     ==================================================================
     Test          : schbench
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [ -0.00]( 8.54)     0.54 [ 45.65](28.79)
       2     1.00 [ -0.00]( 1.15)     0.56 [ 44.00]( 2.09)
       4     1.00 [ -0.00](13.46)     0.67 [ 33.33](35.68)
       8     1.00 [ -0.00]( 7.14)     0.63 [ 36.84]( 4.28)
      16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 9.13)
      32     1.00 [ -0.00]( 1.06)    32.04 [-3104.26](81.31)
      64     1.00 [ -0.00]( 5.48)    24.51 [-2351.16](81.18)
     128     1.00 [ -0.00](10.45)    14.56 [-1356.07]( 5.35)
     256     1.00 [ -0.00](31.14)     0.95 [  4.80](20.88)
     512     1.00 [ -0.00]( 1.52)     1.00 [ -0.25]( 1.26)


     ==================================================================
     Test          : new-schbench-requests-per-second
     Units         : Normalized Requests per second
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [  0.00]( 1.07)     0.97 [ -3.24]( 0.98)
       2     1.00 [  0.00]( 0.00)     0.99 [ -1.17]( 0.15)
       4     1.00 [  0.00]( 0.00)     0.96 [ -3.50]( 0.56)
       8     1.00 [  0.00]( 0.15)     0.98 [ -1.76]( 0.31)
      16     1.00 [  0.00]( 0.00)     0.94 [ -6.13]( 1.93)
      32     1.00 [  0.00]( 3.41)     0.97 [ -3.18]( 2.10)
      64     1.00 [  0.00]( 1.05)     0.82 [-18.14](18.41)
     128     1.00 [  0.00]( 0.00)     0.98 [ -2.27]( 0.20)
     256     1.00 [  0.00]( 0.72)     1.01 [  1.23]( 0.31)
     512     1.00 [  0.00]( 0.57)     1.00 [  0.00]( 0.12)


     ==================================================================
     Test          : new-schbench-wakeup-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [ -0.00]( 9.11)     0.88 [ 12.50](11.92)
       2     1.00 [ -0.00]( 0.00)     0.86 [ 14.29](11.92)
       4     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 4.08)
       8     1.00 [ -0.00]( 0.00)     0.83 [ 16.67]( 5.34)
      16     1.00 [ -0.00]( 7.56)     0.85 [ 15.38]( 0.00)
      32     1.00 [ -0.00](15.11)     0.80 [ 20.00]( 4.19)
      64     1.00 [ -0.00]( 9.63)     1.05 [ -5.00](24.47)
     128     1.00 [ -0.00]( 4.86)     1.57 [-56.78](68.52)
     256     1.00 [ -0.00]( 2.34)     1.00 [ -0.00]( 0.57)
     512     1.00 [ -0.00]( 0.40)     1.00 [ -0.00]( 0.34)


     ==================================================================
     Test          : new-schbench-request-latency
     Units         : Normalized 99th percentile latency in us
     Interpretation: Lower is better
     Statistic     : Median
     ==================================================================
     #workers: tip[pct imp](CV)    llc-aware-lb-v3[pct imp](CV)
       1     1.00 [ -0.00]( 2.73)     1.06 [ -5.71]( 0.25)
       2     1.00 [ -0.00]( 0.87)     1.08 [ -8.37]( 0.78)
       4     1.00 [ -0.00]( 1.21)     1.09 [ -9.15]( 0.79)
       8     1.00 [ -0.00]( 0.27)     1.06 [ -6.31]( 0.51)
      16     1.00 [ -0.00]( 4.04)     1.85 [-84.55]( 5.11)
      32     1.00 [ -0.00]( 7.35)     1.52 [-52.16]( 0.83)
      64     1.00 [ -0.00]( 3.54)     1.06 [ -5.77]( 2.62)
     128     1.00 [ -0.00]( 0.37)     1.09 [ -9.18](28.47)
     256     1.00 [ -0.00]( 9.57)     0.99 [  0.60]( 0.48)
     512     1.00 [ -0.00]( 1.82)     1.03 [ -2.80]( 1.16)


     ==================================================================
     Test          : Various longer running benchmarks
     Units         : %diff in throughput reported
     Interpretation: Higher is better
     Statistic     : Median
     ==================================================================
     Benchmarks:                  %diff
     ycsb-cassandra              -0.99%
     ycsb-mongodb                -0.96%
     deathstarbench-1x           -2.09%
     deathstarbench-2x           -0.26%
     deathstarbench-3x           -3.34%
     deathstarbench-6x           -3.03%
     hammerdb+mysql 16VU         -2.15%
     hammerdb+mysql 64VU         -3.77%

> 
> This patch set is applied on v6.15 kernel.
>   
> There are some further work needed for future versions in this
> patch set.  We will need to align NUMA balancing with LLC aggregations
> such that LLC aggregation will align with the preferred NUMA node.
> 
> Comments and tests are much appreciated.

I'll rerun the test once with the SCHED_FEAT() disabled just to make
sure I'm not regressing because of some other factors. For the major
regressions, I'll get the "perf sched stats" data to see if anything
stands out.

I'm also planning on getting the data from a Zen5c system with larger
LLC to see if there is any difference in the trend (I'll start with the
microbenchmarks since setting the larger ones will take some time)

Sorry for the lack of engagement on previous versions but I plan on
taking a better look at the series this time around. If you need any
specific data from my setup, please do let me know.

-- 
Thanks and Regards,
Prateek