linux-kernel - Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <PUZPR04MB492296C8301DDA9654D7970CE37DA@PUZPR04MB4922.apcprd04.prod.outlook.com>
Date: Thu, 19 Jun 2025 14:08:42 +0800
From: Jianyong Wu <jianyong.wu@...look.com>
To: K Prateek Nayak <kprateek.nayak@....com>,
 Jianyong Wu <wujianyong@...on.cn>, mingo@...hat.com, peterz@...radead.org,
 juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

Hi Prateek,

Thank you for taking the time to test this patch.

This patch aims to reduce meaningless task migrations, such as those in 
iperf tests, which having not considered performance so much. In my 
iperf tests, there wasn't significant performance improvement observed. 
(Notably, the number of task migrations decreased substantially.) Even 
when I bound iperf tasks to the same LLC, performance metrics didn't 
improve significantly. Therefore, this change is unlikely to enhance 
iperf performance notably, indicating that task migration has minimal 
effect on iperf tests.

IMO, we should allow at least two tasks per LLC to enable inter-task 
communication. Theoretically, this could yield better performance, even 
though I haven't found a valid scenario to support this yet.

If this change has bad effect for performance, is there any suggestion 
to mitigate the iperf migration issue? Or just leave it there?

Any suggestions would be greatly appreciated.

Thanks
Jianyong

On 6/18/2025 2:37 PM, K Prateek Nayak wrote:
> Hello Jianyong,
> 
> On 6/16/2025 7:52 AM, Jianyong Wu wrote:
>> Would you mind letting me know if you've had a chance to try it out, 
>> or if there's any update on the progress?
> 
> Here are my results from a dual socket 3rd Generation EPYC
> system.
> 
> tl;dr I don't see any improvement and a few regressions too
> but few of those data points also have a lot of variance.
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> o Kernel details
> 
> tip:       tip:sched/core at commit 914873bc7df9 ("Merge tag
>             'x86-build-2025-05-25' of
>             git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> 
> allow_imb: tip + this series as is
> 
> o Benchmark results
> 
>      ==================================================================
>      Test          : hackbench
>      Units         : Normalized time in seconds
>      Interpretation: Lower is better
>      Statistic     : AMean
>      ==================================================================
>      Case:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>       1-groups     1.00 [ -0.00](13.74)     1.03 [ -3.20]( 9.18)
>       2-groups     1.00 [ -0.00]( 9.58)     1.06 [ -6.46]( 7.63)
>       4-groups     1.00 [ -0.00]( 2.10)     1.01 [ -1.30]( 1.90)
>       8-groups     1.00 [ -0.00]( 1.51)     0.99 [  1.42]( 0.91)
>      16-groups     1.00 [ -0.00]( 1.10)     0.99 [  1.09]( 1.13)
> 
> 
>      ==================================================================
>      Test          : tbench
>      Units         : Normalized throughput
>      Interpretation: Higher is better
>      Statistic     : AMean
>      ==================================================================
>      Clients:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>          1     1.00 [  0.00]( 0.82)     1.01 [  1.11]( 0.27)
>          2     1.00 [  0.00]( 1.13)     1.00 [ -0.05]( 0.62)
>          4     1.00 [  0.00]( 1.12)     1.02 [  2.36]( 0.19)
>          8     1.00 [  0.00]( 0.93)     1.01 [  1.02]( 0.86)
>         16     1.00 [  0.00]( 0.38)     1.01 [  0.71]( 1.71)
>         32     1.00 [  0.00]( 0.66)     1.01 [  1.31]( 1.88)
>         64     1.00 [  0.00]( 1.18)     0.98 [ -1.60]( 2.90)
>        128     1.00 [  0.00]( 1.12)     1.02 [  1.60]( 0.42)
>        256     1.00 [  0.00]( 0.42)     1.00 [  0.40]( 0.80)
>        512     1.00 [  0.00]( 0.14)     1.01 [  0.97]( 0.25)
>       1024     1.00 [  0.00]( 0.26)     1.01 [  1.29]( 0.19)
> 
> 
>      ==================================================================
>      Test          : stream-10
>      Units         : Normalized Bandwidth, MB/s
>      Interpretation: Higher is better
>      Statistic     : HMean
>      ==================================================================
>      Test:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>       Copy     1.00 [  0.00]( 8.37)     1.01 [  1.00]( 5.71)
>      Scale     1.00 [  0.00]( 2.85)     0.98 [ -1.94]( 5.23)
>        Add     1.00 [  0.00]( 3.39)     0.99 [ -1.39]( 4.77)
>      Triad     1.00 [  0.00]( 6.39)     1.05 [  5.15]( 5.62)
> 
> 
>      ==================================================================
>      Test          : stream-100
>      Units         : Normalized Bandwidth, MB/s
>      Interpretation: Higher is better
>      Statistic     : HMean
>      ==================================================================
>      Test:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>       Copy     1.00 [  0.00]( 3.91)     1.01 [  1.28]( 2.01)
>      Scale     1.00 [  0.00]( 4.34)     0.99 [ -0.65]( 3.74)
>        Add     1.00 [  0.00]( 4.14)     1.01 [  0.54]( 1.63)
>      Triad     1.00 [  0.00]( 1.00)     0.98 [ -2.28]( 4.89)
> 
> 
>      ==================================================================
>      Test          : netperf
>      Units         : Normalized Througput
>      Interpretation: Higher is better
>      Statistic     : AMean
>      ==================================================================
>      Clients:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>       1-clients     1.00 [  0.00]( 0.41)     1.01 [  1.17]( 0.39)
>       2-clients     1.00 [  0.00]( 0.58)     1.01 [  1.00]( 0.40)
>       4-clients     1.00 [  0.00]( 0.35)     1.01 [  0.73]( 0.50)
>       8-clients     1.00 [  0.00]( 0.48)     1.00 [  0.42]( 0.67)
>      16-clients     1.00 [  0.00]( 0.66)     1.01 [  0.84]( 0.57)
>      32-clients     1.00 [  0.00]( 1.15)     1.01 [  0.82]( 0.96)
>      64-clients     1.00 [  0.00]( 1.38)     1.00 [ -0.24]( 3.09)
>      128-clients    1.00 [  0.00]( 0.87)     1.00 [ -0.16]( 1.02)
>      256-clients    1.00 [  0.00]( 5.36)     1.01 [  0.66]( 4.55)
>      512-clients    1.00 [  0.00](54.39)     0.98 [ -1.59](57.35)
> 
> 
>      ==================================================================
>      Test          : schbench
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>        1     1.00 [ -0.00]( 8.54)     1.04 [ -4.35]( 3.69)
>        2     1.00 [ -0.00]( 1.15)     0.96 [  4.00]( 0.00)
>        4     1.00 [ -0.00](13.46)     1.02 [ -2.08]( 2.04)
>        8     1.00 [ -0.00]( 7.14)     0.82 [ 17.54]( 9.30)
>       16     1.00 [ -0.00]( 3.49)     1.05 [ -5.08]( 7.83)
>       32     1.00 [ -0.00]( 1.06)     1.01 [ -1.06]( 5.88)
>       64     1.00 [ -0.00]( 5.48)     1.05 [ -4.65]( 2.71)
>      128     1.00 [ -0.00](10.45)     1.09 [ -9.11](14.18)
>      256     1.00 [ -0.00](31.14)     1.05 [ -5.15]( 9.79)
>      512     1.00 [ -0.00]( 1.52)     0.96 [  4.30]( 0.26)
> 
> 
>      ==================================================================
>      Test          : new-schbench-requests-per-second
>      Units         : Normalized Requests per second
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>        1     1.00 [  0.00]( 1.07)     1.00 [  0.29]( 0.61)
>        2     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.26)
>        4     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.00)
>        8     1.00 [  0.00]( 0.15)     1.00 [  0.29]( 0.15)
>       16     1.00 [  0.00]( 0.00)     1.00 [  0.00]( 0.00)
>       32     1.00 [  0.00]( 3.41)     0.97 [ -2.86]( 2.91)
>       64     1.00 [  0.00]( 1.05)     0.97 [ -3.17]( 7.39)
>      128     1.00 [  0.00]( 0.00)     1.00 [ -0.38]( 0.39)
>      256     1.00 [  0.00]( 0.72)     1.01 [  0.61]( 0.96)
>      512     1.00 [  0.00]( 0.57)     1.01 [  0.72]( 0.21)
> 
> 
>      ==================================================================
>      Test          : new-schbench-wakeup-latency
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>        1     1.00 [ -0.00]( 9.11)     0.69 [ 31.25]( 8.13)
>        2     1.00 [ -0.00]( 0.00)     0.93 [  7.14]( 8.37)
>        4     1.00 [ -0.00]( 3.78)     1.07 [ -7.14](14.79)
>        8     1.00 [ -0.00]( 0.00)     1.08 [ -8.33]( 7.56)
>       16     1.00 [ -0.00]( 7.56)     1.08 [ -7.69](34.36)
>       32     1.00 [ -0.00](15.11)     1.00 [ -0.00](12.99)
>       64     1.00 [ -0.00]( 9.63)     0.80 [ 20.00](11.17)
>      128     1.00 [ -0.00]( 4.86)     0.98 [  2.01](13.01)
>      256     1.00 [ -0.00]( 2.34)     1.01 [ -1.00]( 3.51)
>      512     1.00 [ -0.00]( 0.40)     1.00 [  0.38]( 0.20)
> 
> 
>      ==================================================================
>      Test          : new-schbench-request-latency
>      Units         : Normalized 99th percentile latency in us
>      Interpretation: Lower is better
>      Statistic     : Median
>      ==================================================================
>      #workers:           tip[pct imp](CV)     allow_imb[pct imp](CV)
>        1     1.00 [ -0.00]( 2.73)     0.98 [  2.08]( 3.51)
>        2     1.00 [ -0.00]( 0.87)     0.99 [  0.54]( 3.29)
>        4     1.00 [ -0.00]( 1.21)     1.06 [ -5.92]( 0.82)
>        8     1.00 [ -0.00]( 0.27)     1.03 [ -3.15]( 1.86)
>       16     1.00 [ -0.00]( 4.04)     1.00 [ -0.27]( 2.27)
>       32     1.00 [ -0.00]( 7.35)     1.30 [-30.45](20.57)
>       64     1.00 [ -0.00]( 3.54)     1.01 [ -0.67]( 0.82)
>      128     1.00 [ -0.00]( 0.37)     1.00 [  0.21]( 0.18)
>      256     1.00 [ -0.00]( 9.57)     0.99 [  1.43]( 7.69)
>      512     1.00 [ -0.00]( 1.82)     1.02 [ -2.10]( 0.89)
> 
> 
>      ==================================================================
>      Test          : Various longer running benchmarks
>      Units         : %diff in throughput reported
>      Interpretation: Higher is better
>      Statistic     : Median
>      ==================================================================
>      Benchmarks:                  %diff
>      ycsb-cassandra               0.07%
>      ycsb-mongodb                -0.66%
> 
>      deathstarbench-1x            0.36%
>      deathstarbench-2x            2.39%
>      deathstarbench-3x           -0.09%
>      deathstarbench-6x            1.53%
> 
>      hammerdb+mysql 16VU         -0.27%
>      hammerdb+mysql 64VU         -0.32%
> 
> ---
> 
> I cannot make a hard case for this optimization. You can perhaps
> share your iperf numbers if you are seeing significant
> improvements there.
>