linux-kernel - Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2225c999-8d06-40a9-9d55-76d2cfabacb8@intel.com>
Date: Tue, 29 Apr 2025 20:57:11 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>, Abel Wu
	<wuyun.abel@...edance.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
	Hillf Danton <hdanton@...a.com>, <linux-kernel@...r.kernel.org>, "Peter
 Zijlstra" <peterz@...radead.org>, "Gautham R . Shenoy"
	<gautham.shenoy@....com>, Ingo Molnar <mingo@...hat.com>, Len Brown
	<len.brown@...el.com>
Subject: Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling

Hi Prateek,

On 4/29/2025 11:47 AM, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 4/21/2025 8:53 AM, Chen Yu wrote:
>> This is a respin of the cache-aware scheduling proposed by Peter[1].
>> In this patch set, some known issues in [1] were addressed, and the 
>> performance
>> regression was investigated and mitigated.
>>
>> Cache-aware scheduling aims to aggregate tasks with potential shared 
>> resources
>> into the same cache domain. This approach enhances cache locality, 
>> thereby optimizing
>> system performance by reducing cache misses and improving data access 
>> efficiency.
>>
>> In the current implementation, threads within the same process are 
>> considered as
>> entities that potentially share resources. Cache-aware scheduling 
>> monitors the CPU
>> occupancy of each cache domain for every process. Based on this 
>> monitoring, it endeavors
>> to migrate threads within a given process to its cache-hot domains, 
>> with the goal of
>> maximizing cache locality.
>>
>> Patch 1 constitutes the fundamental cache-aware scheduling. It is the 
>> same patch as [1].
>> Patch 2 comprises a series of fixes for Patch 1, including compiling 
>> warnings and functional
>> fixes.
>> Patch 3 fixes performance degradation that arise from excessive task 
>> migrations within the
>> preferred LLC domain.
>> Patch 4 further alleviates performance regressions when the preferred 
>> LLC becomes saturated.
>> Patch 5 introduces ftrace events, which is used to track task 
>> migrations triggered by wakeup
>> and load balancer. This addition facilitate performance regression 
>> analysis.
>>
>> The patch set is applied on top of v6.14 sched/core,
>> commit 4ba7518327c6 ("sched/debug: Print the local group's 
>> asym_prefer_cpu")
>>
> 
> Thank you for working on this! I have been a bit preoccupied but I
> promise to look into the regressions I've reported below sometime
> this week and report back soon on what seems to make them unhappy.
> 

Thanks for your time on this testings.

> tl;dr
> 
> o Most regressions aren't as severe as v1 thanks to all the work
>    from you and Abel.
> 
> o I too see schbench regress in fully loaded cases but the old
>    schbench tail latencies improve when #threads < #CPUs in LLC
> 
> o There is a consistent regression in tbench - what I presume is
>    happening there is all threads of "tbench_srv" share an mm and
>    and all the tbench clients share an mm but for best performance,
>    the wakeups between client and server must be local (same core /
>    same LLC) but either the cost of additional search build up or
>    the clients get co-located as one set of entities and the
>    servers get colocated as another set of entities leading to
>    mostly remote wakeups.

This is a good point. If A and B are both multi-threaded processes,
and A interacts with B frequently, we should not only consider
aggregating the threads within A and B, but also placing A and
B together. I'm not sure if WF_SYNC is carried along and takes
effect during the tbench socket wakeup process. I'll also try
tbench/netperf testings.

> 
>    Not too sure if netperf has similar architecture as tbench but
>    that too sees a regression.
> 
> o Longer running benchmarks see a regression. Still not sure if
>    this is because of additional search or something else.
> 
> I'll leave the full results below:
> 
> o Machine details
> 
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)

> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
> 
> o Benchmark results
> 
>    ==================================================================
>    Test          : hackbench
>    Units         : Normalized time in seconds
>    Interpretation: Lower is better
>    Statistic     : AMean
>    ==================================================================
>    Case:           tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     1-groups     1.00 [ -0.00]( 9.02)     1.03 [ -3.38](11.44)
>     2-groups     1.00 [ -0.00]( 6.86)     0.98 [  2.20]( 6.61)
>     4-groups     1.00 [ -0.00]( 2.73)     1.00 [  0.42]( 4.00)
>     8-groups     1.00 [ -0.00]( 1.21)     1.04 [ -4.00]( 5.59)
>    16-groups     1.00 [ -0.00]( 0.97)     1.01 [ -0.52]( 2.12)
> 
> 
>    ==================================================================
>    Test          : tbench
>    Units         : Normalized throughput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:    tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>        1     1.00 [  0.00]( 0.67)     0.96 [ -3.95]( 0.55)
>        2     1.00 [  0.00]( 0.85)     0.98 [ -1.69]( 0.65)
>        4     1.00 [  0.00]( 0.52)     0.96 [ -3.68]( 0.09)
>        8     1.00 [  0.00]( 0.92)     0.96 [ -4.06]( 0.43)
>       16     1.00 [  0.00]( 1.01)     0.95 [ -5.19]( 1.65)
>       32     1.00 [  0.00]( 1.35)     0.95 [ -4.79]( 0.29)
>       64     1.00 [  0.00]( 1.22)     0.94 [ -6.49]( 1.46)
>      128     1.00 [  0.00]( 2.39)     0.92 [ -7.61]( 1.41)
>      256     1.00 [  0.00]( 1.83)     0.92 [ -8.24]( 0.35)
>      512     1.00 [  0.00]( 0.17)     0.93 [ -7.08]( 0.22)
>     1024     1.00 [  0.00]( 0.31)     0.91 [ -8.57]( 0.29)
> 
> 
>    ==================================================================
>    Test          : stream-10
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     Copy     1.00 [  0.00]( 8.24)     1.03 [  2.66]( 6.15)
>    Scale     1.00 [  0.00]( 5.62)     0.99 [ -1.43]( 6.32)
>      Add     1.00 [  0.00]( 6.18)     0.97 [ -3.12]( 5.70)
>    Triad     1.00 [  0.00]( 5.29)     1.01 [  1.31]( 3.82)
> 
> 
>    ==================================================================
>    Test          : stream-100
>    Units         : Normalized Bandwidth, MB/s
>    Interpretation: Higher is better
>    Statistic     : HMean
>    ==================================================================
>    Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     Copy     1.00 [  0.00]( 2.92)     0.99 [ -1.47]( 5.02)
>    Scale     1.00 [  0.00]( 4.80)     0.98 [ -2.08]( 5.53)
>      Add     1.00 [  0.00]( 4.35)     0.98 [ -1.85]( 4.26)
>    Triad     1.00 [  0.00]( 2.30)     0.99 [ -0.84]( 1.83)
> 
> 
>    ==================================================================
>    Test          : netperf
>    Units         : Normalized Througput
>    Interpretation: Higher is better
>    Statistic     : AMean
>    ==================================================================
>    Clients:         tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>     1-clients     1.00 [  0.00]( 0.17)     0.97 [ -2.55]( 0.50)
>     2-clients     1.00 [  0.00]( 0.77)     0.97 [ -2.52]( 0.20)
>     4-clients     1.00 [  0.00]( 0.93)     0.97 [ -3.30]( 0.54)
>     8-clients     1.00 [  0.00]( 0.87)     0.96 [ -3.98]( 1.19)
>    16-clients     1.00 [  0.00]( 1.15)     0.96 [ -4.16]( 1.06)
>    32-clients     1.00 [  0.00]( 1.00)     0.95 [ -5.47]( 0.96)
>    64-clients     1.00 [  0.00]( 1.37)     0.94 [ -5.75]( 1.64)
>    128-clients    1.00 [  0.00]( 0.99)     0.92 [ -8.50]( 1.49)
>    256-clients    1.00 [  0.00]( 3.23)     0.90 [-10.22]( 2.86)
>    512-clients    1.00 [  0.00](58.43)     0.90 [-10.28](47.59)
> 
> 
>    ==================================================================
>    Test          : schbench
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [ -0.00]( 5.59)     0.55 [ 45.00](11.17)
>      2     1.00 [ -0.00](14.29)     0.52 [ 47.62]( 7.53)
>      4     1.00 [ -0.00]( 1.24)     0.57 [ 42.55]( 5.73)
>      8     1.00 [ -0.00](11.16)     1.06 [ -6.12]( 2.92)
>     16     1.00 [ -0.00]( 6.81)     1.12 [-12.28](11.09)
>     32     1.00 [ -0.00]( 6.99)     1.05 [ -5.26](12.48)
>     64     1.00 [ -0.00]( 6.00)     0.96 [  4.21](18.31)
>    128     1.00 [ -0.00]( 3.26)     1.63 [-62.84](36.71)
>    256     1.00 [ -0.00](19.29)     0.97 [  3.25]( 4.94)
>    512     1.00 [ -0.00]( 1.48)     1.05 [ -4.71]( 5.11)
> 
> 
>    ==================================================================
>    Test          : new-schbench-requests-per-second
>    Units         : Normalized Requests per second
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [  0.00]( 0.00)     0.95 [ -4.99]( 0.48)
>      2     1.00 [  0.00]( 0.26)     0.96 [ -3.82]( 0.55)
>      4     1.00 [  0.00]( 0.15)     0.95 [ -4.96]( 0.27)
>      8     1.00 [  0.00]( 0.15)     0.99 [ -0.58]( 0.00)
>     16     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
>     32     1.00 [  0.00]( 4.88)     1.04 [  4.27]( 2.42)
>     64     1.00 [  0.00]( 5.57)     0.87 [-13.10](11.51)
>    128     1.00 [  0.00]( 0.34)     0.97 [ -3.13]( 0.58)
>    256     1.00 [  0.00]( 1.95)     1.02 [  1.83]( 0.15)
>    512     1.00 [  0.00]( 0.44)     1.00 [  0.48]( 0.12)
> 
> 
>    ==================================================================
>    Test          : new-schbench-wakeup-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [ -0.00]( 4.19)     1.00 [ -0.00](14.91)
>      2     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 0.00)
>      4     1.00 [ -0.00]( 8.91)     0.80 [ 20.00]( 4.43)
>      8     1.00 [ -0.00]( 7.45)     1.00 [ -0.00]( 7.45)
>     16     1.00 [ -0.00]( 4.08)     1.00 [ -0.00](10.79)
>     32     1.00 [ -0.00](16.90)     0.93 [  6.67](10.00)
>     64     1.00 [ -0.00]( 9.11)     1.12 [-12.50]( 0.00)
>    128     1.00 [ -0.00]( 7.05)     2.43 [-142.86](24.47)

OK, this was what I saw too. I'm looking into this.

>    256     1.00 [ -0.00]( 4.32)     1.02 [ -2.34]( 1.20)
>    512     1.00 [ -0.00]( 0.35)     1.01 [ -0.77]( 0.40)
> 
> 
>    ==================================================================
>    Test          : new-schbench-request-latency
>    Units         : Normalized 99th percentile latency in us
>    Interpretation: Lower is better
>    Statistic     : Median
>    ==================================================================
>    #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
>      1     1.00 [ -0.00]( 0.78)     1.16 [-15.70]( 2.14)
>      2     1.00 [ -0.00]( 0.81)     1.13 [-13.11]( 0.62)
>      4     1.00 [ -0.00]( 0.24)     1.26 [-26.11](16.43)
>      8     1.00 [ -0.00]( 1.30)     1.03 [ -3.46]( 0.81)
>     16     1.00 [ -0.00]( 1.11)     1.02 [ -2.12]( 1.85)
>     32     1.00 [ -0.00]( 5.94)     0.96 [  4.05]( 4.48)
>     64     1.00 [ -0.00]( 6.27)     1.06 [ -6.01]( 6.67)
>    128     1.00 [ -0.00]( 0.21)     1.12 [-12.31]( 2.61)
>    256     1.00 [ -0.00](13.73)     1.06 [ -6.30]( 3.37)
>    512     1.00 [ -0.00]( 0.95)     1.05 [ -4.85]( 0.61)
> 
> 
>    ==================================================================
>    Test          : Various longer running benchmarks
>    Units         : %diff in throughput reported
>    Interpretation: Higher is better
>    Statistic     : Median
>    ==================================================================
>    Benchmarks:                 %diff
>    ycsb-cassandra              -1.21%
>    ycsb-mongodb                -0.69%
> 
>    deathstarbench-1x           -7.40%
>    deathstarbench-2x           -3.80%
>    deathstarbench-3x           -3.99%
>    deathstarbench-6x           -3.02%
> 
>    hammerdb+mysql 16VU         -2.59%
>    hammerdb+mysql 64VU         -1.05%
> 

For long-duration task, the penalty of remote cache access is severe. 
This might indicate a similar issue as tbench/netperf as you mentioned,
different processes are aggregated to different LLCs, but these 
processes interact with each other and WF_SYNC did not take effect.

> 
> Also, could you fold the below diff into your Patch2:
> 

Sure, let me apply it and do the test.

thanks,
Chenyu

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eb5a2572b4f8..6c51dd2b7b32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7694,8 +7694,6 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, bool
>       int i, cpu, idle_cpu = -1, nr = INT_MAX;
>       struct sched_domain_shared *sd_share;
> 
> -    cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> -
>       if (sched_feat(SIS_UTIL)) {
>           sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
>           if (sd_share) {
> @@ -7707,6 +7705,8 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, bool
>           }
>       }
> 
> +    cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
>       if (static_branch_unlikely(&sched_cluster_active)) {
>           struct sched_group *sg = sd->groups;
> 
> ---
> 
> If the SIS_UTIL cut off hits, that result of the cpumask_and() is of no
> use. To save some additional cycles, especially in cases where we target
> the LLC frequently and the search bails out because the LLC is busy,
> this overhead can be easily avoided. Since select_idle_cpu() can now be
> called twice per wakeup, this overhead can be visible in benchmarks like
> hackbench.
>