linux-kernel - Re: [RFC][PATCH] sched: Cache aware load-balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <17d0e2be-6130-496f-9a80-49a67a749d01@amd.com>
Date: Wed, 26 Mar 2025 11:48:11 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>, "Chen, Yu C" <yu.c.chen@...el.com>
CC: <juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
	<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
	<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
	<tim.c.chen@...ux.intel.com>, <tglx@...utronix.de>, <len.brown@...el.com>,
	<gautham.shenoy@....com>, <mingo@...nel.org>, <yu.chen.surf@...mail.com>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing

Hello Peter, Chenyu,

On 3/26/2025 12:14 AM, Peter Zijlstra wrote:
> On Tue, Mar 25, 2025 at 11:19:52PM +0800, Chen, Yu C wrote:
>>
>> Hi Peter,
>>
>> Thanks for sending this out,
>>
>> On 3/25/2025 8:09 PM, Peter Zijlstra wrote:
>>> Hi all,
>>>
>>> One of the many things on the eternal todo list has been finishing the
>>> below hackery.
>>>
>>> It is an attempt at modelling cache affinity -- and while the patch
>>> really only targets LLC, it could very well be extended to also apply to
>>> clusters (L2). Specifically any case of multiple cache domains inside a
>>> node.
>>>
>>> Anyway, I wrote this about a year ago, and I mentioned this at the
>>> recent OSPM conf where Gautham and Prateek expressed interest in playing
>>> with this code.
>>>
>>> So here goes, very rough and largely unproven code ahead :-)
>>>
>>> It applies to current tip/master, but I know it will fail the __percpu
>>> validation that sits in -next, although that shouldn't be terribly hard
>>> to fix up.
>>>
>>> As is, it only computes a CPU inside the LLC that has the highest recent
>>> runtime, this CPU is then used in the wake-up path to steer towards this
>>> LLC and in task_hot() to limit migrations away from it.
>>>
>>> More elaborate things could be done, notably there is an XXX in there
>>> somewhere about finding the best LLC inside a NODE (interaction with
>>> NUMA_BALANCING).
>>>
>>
>> Besides the control provided by CONFIG_SCHED_CACHE, could we also introduce
>> sched_feat(SCHED_CACHE) to manage this feature, facilitating dynamic
>> adjustments? Similarly we can also introduce other sched_feats for load
>> balancing and NUMA balancing for fine-grain control.
> 
> We can do all sorts, but the very first thing is determining if this is
> worth it at all. Because if we can't make this work at all, all those
> things are a waste of time.
> 
> This patch is not meant to be merged, it is meant for testing and
> development. We need to first make it actually improve workloads. If it
> then turns out it regresses workloads (likely, things always do), then
> we can look at how to best do that.
> 

Thank you for sharing the patch and the initial review from Chenyu
pointing to issues that need fixing. I'll try to take a good look at it
this week and see if I can improve some trivial benchmarks that regress
currently with RFC as is.

In its current form I think this suffers from the same problem as
SIS_NODE where wakeups redirect to same set of CPUs and a good deal of
additional work is being done without any benefit.

I'll leave the results from my initial testing on the 3rd Generation
EPYC platform below and will evaluate what is making the benchmarks
unhappy. I'll return with more data when some of these benchmarks
are not as unhappy as they are now.

Thank you both for the RFC and the initial feedback. Following are
the initial results for the RFC as is:

   ==================================================================
   Test          : hackbench
   Units         : Normalized time in seconds
   Interpretation: Lower is better
   Statistic     : AMean
   ==================================================================
   Case:           tip[pct imp](CV)      sched_cache[pct imp](CV)
    1-groups     1.00 [ -0.00](10.12)     1.01 [ -0.89]( 2.84)
    2-groups     1.00 [ -0.00]( 6.92)     1.83 [-83.15]( 1.61)
    4-groups     1.00 [ -0.00]( 3.14)     3.00 [-200.21]( 3.13)
    8-groups     1.00 [ -0.00]( 1.35)     3.44 [-243.75]( 2.20)
   16-groups     1.00 [ -0.00]( 1.32)     2.59 [-158.98]( 4.29)


   ==================================================================
   Test          : tbench
   Units         : Normalized throughput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:    tip[pct imp](CV)     sched_cache[pct imp](CV)
       1     1.00 [  0.00]( 0.43)     0.96 [ -3.54]( 0.56)
       2     1.00 [  0.00]( 0.58)     0.99 [ -1.32]( 1.40)
       4     1.00 [  0.00]( 0.54)     0.98 [ -2.34]( 0.78)
       8     1.00 [  0.00]( 0.49)     0.96 [ -3.91]( 0.54)
      16     1.00 [  0.00]( 1.06)     0.97 [ -3.22]( 1.82)
      32     1.00 [  0.00]( 1.27)     0.95 [ -4.74]( 2.05)
      64     1.00 [  0.00]( 1.54)     0.93 [ -6.65]( 0.63)
     128     1.00 [  0.00]( 0.38)     0.93 [ -6.91]( 1.18)
     256     1.00 [  0.00]( 1.85)     0.99 [ -0.50]( 1.34)
     512     1.00 [  0.00]( 0.31)     0.98 [ -2.47]( 0.14)
    1024     1.00 [  0.00]( 0.19)     0.97 [ -3.06]( 0.39)


   ==================================================================
   Test          : stream-10
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
    Copy     1.00 [  0.00](11.31)     0.34 [-65.89](72.77)
   Scale     1.00 [  0.00]( 6.62)     0.32 [-68.09](72.49)
     Add     1.00 [  0.00]( 7.06)     0.34 [-65.56](70.56)
   Triad     1.00 [  0.00]( 8.91)     0.34 [-66.47](72.70)


   ==================================================================
   Test          : stream-100
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)     sched_cache[pct imp](CV)
    Copy     1.00 [  0.00]( 2.01)     0.83 [-16.96](24.55)
   Scale     1.00 [  0.00]( 1.49)     0.79 [-21.40](24.10)
     Add     1.00 [  0.00]( 2.67)     0.79 [-21.33](25.39)
   Triad     1.00 [  0.00]( 2.19)     0.81 [-19.19](25.55)


   ==================================================================
   Test          : netperf
   Units         : Normalized Througput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:         tip[pct imp](CV)     sched_cache[pct imp](CV)
    1-clients     1.00 [  0.00]( 1.43)     0.98 [ -2.22]( 0.26)
    2-clients     1.00 [  0.00]( 1.02)     0.97 [ -2.55]( 0.89)
    4-clients     1.00 [  0.00]( 0.83)     0.98 [ -2.27]( 0.46)
    8-clients     1.00 [  0.00]( 0.73)     0.98 [ -2.45]( 0.80)
   16-clients     1.00 [  0.00]( 0.97)     0.97 [ -2.90]( 0.88)
   32-clients     1.00 [  0.00]( 0.88)     0.95 [ -5.29]( 1.69)
   64-clients     1.00 [  0.00]( 1.49)     0.91 [ -8.70]( 1.95)
   128-clients    1.00 [  0.00]( 1.05)     0.92 [ -8.39]( 4.25)
   256-clients    1.00 [  0.00]( 3.85)     0.92 [ -8.33]( 2.45)
   512-clients    1.00 [  0.00](59.63)     0.92 [ -7.83](51.19)


   ==================================================================
   Test          : schbench
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)       sched_cache[pct imp](CV)
     1     1.00 [ -0.00]( 6.67)      0.38 [ 62.22]    ( 5.88)
     2     1.00 [ -0.00](10.18)      0.43 [ 56.52]    ( 2.94)
     4     1.00 [ -0.00]( 4.49)      0.60 [ 40.43]    ( 5.52)
     8     1.00 [ -0.00]( 6.68)    113.96 [-11296.23] (12.91)
    16     1.00 [ -0.00]( 1.87)    359.34 [-35834.43] (20.02)
    32     1.00 [ -0.00]( 4.01)    217.67 [-21667.03] ( 5.48)
    64     1.00 [ -0.00]( 3.21)     97.43 [-9643.02]  ( 4.61)
   128     1.00 [ -0.00](44.13)     41.36 [-4036.10]  ( 6.92)
   256     1.00 [ -0.00](14.46)      2.69 [-169.31]   ( 1.86)
   512     1.00 [ -0.00]( 1.95)      1.89 [-89.22]    ( 2.24)


   ==================================================================
   Test          : new-schbench-requests-per-second
   Units         : Normalized Requests per second
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)      sched_cache[pct imp](CV)
     1     1.00 [  0.00]( 0.46)     0.96 [ -4.14]( 0.00)
     2     1.00 [  0.00]( 0.15)     0.95 [ -5.27]( 2.29)
     4     1.00 [  0.00]( 0.15)     0.88 [-12.01]( 0.46)
     8     1.00 [  0.00]( 0.15)     0.55 [-45.47]( 1.23)
    16     1.00 [  0.00]( 0.00)     0.54 [-45.62]( 0.50)
    32     1.00 [  0.00]( 3.40)     0.63 [-37.48]( 6.37)
    64     1.00 [  0.00]( 7.09)     0.67 [-32.73]( 0.59)
   128     1.00 [  0.00]( 0.00)     0.99 [ -0.76]( 0.34)
   256     1.00 [  0.00]( 1.12)     1.06 [  6.32]( 1.55)
   512     1.00 [  0.00]( 0.22)     1.06 [  6.08]( 0.92)


   ==================================================================
   Test          : new-schbench-wakeup-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)       sched_cache[pct imp](CV)
     1     1.00 [ -0.00](19.72)     0.85  [ 15.38]    ( 8.13)
     2     1.00 [ -0.00](15.96)     1.09  [ -9.09]    (18.20)
     4     1.00 [ -0.00]( 3.87)     1.00  [ -0.00]    ( 0.00)
     8     1.00 [ -0.00]( 8.15)    118.17 [-11716.67] ( 0.58)
    16     1.00 [ -0.00]( 3.87)    146.62 [-14561.54] ( 4.64)
    32     1.00 [ -0.00](12.99)    141.60 [-14060.00] ( 5.64)
    64     1.00 [ -0.00]( 6.20)    78.62  [-7762.50]  ( 1.79)
   128     1.00 [ -0.00]( 0.96)    11.36  [-1036.08]  ( 3.41)
   256     1.00 [ -0.00]( 2.76)     1.11  [-11.22]    ( 3.28)
   512     1.00 [ -0.00]( 0.20)     1.21  [-20.81]    ( 0.91)


   ==================================================================
   Test          : new-schbench-request-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)      sched_cache[pct imp](CV)
     1     1.00 [ -0.00]( 1.07)     1.11 [-10.66]  ( 2.76)
     2     1.00 [ -0.00]( 0.14)     1.20 [-20.40]  ( 1.73)
     4     1.00 [ -0.00]( 1.39)     2.04 [-104.20] ( 0.96)
     8     1.00 [ -0.00]( 0.36)     3.94 [-294.20] ( 2.85)
    16     1.00 [ -0.00]( 1.18)     4.56 [-356.16] ( 1.19)
    32     1.00 [ -0.00]( 8.42)     3.02 [-201.67] ( 8.93)
    64     1.00 [ -0.00]( 4.85)     1.51 [-51.38]  ( 0.80)
   128     1.00 [ -0.00]( 0.28)     1.83 [-82.77]  ( 1.21)
   256     1.00 [ -0.00](10.52)     1.43 [-43.11]  (10.67)
   512     1.00 [ -0.00]( 0.69)     1.25 [-24.96]  ( 6.24)


   ==================================================================
   Test          : Various longer running benchmarks
   Units         : %diff in throughput reported
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   Benchmarks:                 %diff
   ycsb-cassandra             -10.70%
   ycsb-mongodb               -13.66%

   deathstarbench-1x           13.87%
   deathstarbench-2x            1.70%
   deathstarbench-3x           -8.44%
   deathstarbench-6x           -3.12%

   hammerdb+mysql 16VU        -33.50%
   hammerdb+mysql 64VU        -33.22%

---

I'm planning on taking hackbench and schbench as two extreme cases for
throughput and tail latency and later look at Stream from a "high
bandwidth, don't consolidate" standpoint. I hope once those cases
aren't as much in the reds, the larger benchmarks will be happier too.

-- 
Thanks and Regards,
Prateek