[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7c5fcd32-1f0f-4148-ab0e-0a25ea11c10f@amd.com>
Date: Tue, 29 Apr 2025 09:17:43 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Chen Yu <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>,
"Ingo Molnar" <mingo@...hat.com>, "Gautham R . Shenoy"
<gautham.shenoy@....com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>, Abel Wu
<wuyun.abel@...edance.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling
Hello Chenyu,
On 4/21/2025 8:53 AM, Chen Yu wrote:
> This is a respin of the cache-aware scheduling proposed by Peter[1].
> In this patch set, some known issues in [1] were addressed, and the performance
> regression was investigated and mitigated.
>
> Cache-aware scheduling aims to aggregate tasks with potential shared resources
> into the same cache domain. This approach enhances cache locality, thereby optimizing
> system performance by reducing cache misses and improving data access efficiency.
>
> In the current implementation, threads within the same process are considered as
> entities that potentially share resources. Cache-aware scheduling monitors the CPU
> occupancy of each cache domain for every process. Based on this monitoring, it endeavors
> to migrate threads within a given process to its cache-hot domains, with the goal of
> maximizing cache locality.
>
> Patch 1 constitutes the fundamental cache-aware scheduling. It is the same patch as [1].
> Patch 2 comprises a series of fixes for Patch 1, including compiling warnings and functional
> fixes.
> Patch 3 fixes performance degradation that arise from excessive task migrations within the
> preferred LLC domain.
> Patch 4 further alleviates performance regressions when the preferred LLC becomes saturated.
> Patch 5 introduces ftrace events, which is used to track task migrations triggered by wakeup
> and load balancer. This addition facilitate performance regression analysis.
>
> The patch set is applied on top of v6.14 sched/core,
> commit 4ba7518327c6 ("sched/debug: Print the local group's asym_prefer_cpu")
>
Thank you for working on this! I have been a bit preoccupied but I
promise to look into the regressions I've reported below sometime
this week and report back soon on what seems to make them unhappy.
tl;dr
o Most regressions aren't as severe as v1 thanks to all the work
from you and Abel.
o I too see schbench regress in fully loaded cases but the old
schbench tail latencies improve when #threads < #CPUs in LLC
o There is a consistent regression in tbench - what I presume is
happening there is all threads of "tbench_srv" share an mm and
and all the tbench clients share an mm but for best performance,
the wakeups between client and server must be local (same core /
same LLC) but either the cost of additional search build up or
the clients get co-located as one set of entities and the
servers get colocated as another set of entities leading to
mostly remote wakeups.
Not too sure if netperf has similar architecture as tbench but
that too sees a regression.
o Longer running benchmarks see a regression. Still not sure if
this is because of additional search or something else.
I'll leave the full results below:
o Machine details
- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)
o Benchmark results
==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
1-groups 1.00 [ -0.00]( 9.02) 1.03 [ -3.38](11.44)
2-groups 1.00 [ -0.00]( 6.86) 0.98 [ 2.20]( 6.61)
4-groups 1.00 [ -0.00]( 2.73) 1.00 [ 0.42]( 4.00)
8-groups 1.00 [ -0.00]( 1.21) 1.04 [ -4.00]( 5.59)
16-groups 1.00 [ -0.00]( 0.97) 1.01 [ -0.52]( 2.12)
==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
1 1.00 [ 0.00]( 0.67) 0.96 [ -3.95]( 0.55)
2 1.00 [ 0.00]( 0.85) 0.98 [ -1.69]( 0.65)
4 1.00 [ 0.00]( 0.52) 0.96 [ -3.68]( 0.09)
8 1.00 [ 0.00]( 0.92) 0.96 [ -4.06]( 0.43)
16 1.00 [ 0.00]( 1.01) 0.95 [ -5.19]( 1.65)
32 1.00 [ 0.00]( 1.35) 0.95 [ -4.79]( 0.29)
64 1.00 [ 0.00]( 1.22) 0.94 [ -6.49]( 1.46)
128 1.00 [ 0.00]( 2.39) 0.92 [ -7.61]( 1.41)
256 1.00 [ 0.00]( 1.83) 0.92 [ -8.24]( 0.35)
512 1.00 [ 0.00]( 0.17) 0.93 [ -7.08]( 0.22)
1024 1.00 [ 0.00]( 0.31) 0.91 [ -8.57]( 0.29)
==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
Copy 1.00 [ 0.00]( 8.24) 1.03 [ 2.66]( 6.15)
Scale 1.00 [ 0.00]( 5.62) 0.99 [ -1.43]( 6.32)
Add 1.00 [ 0.00]( 6.18) 0.97 [ -3.12]( 5.70)
Triad 1.00 [ 0.00]( 5.29) 1.01 [ 1.31]( 3.82)
==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
Copy 1.00 [ 0.00]( 2.92) 0.99 [ -1.47]( 5.02)
Scale 1.00 [ 0.00]( 4.80) 0.98 [ -2.08]( 5.53)
Add 1.00 [ 0.00]( 4.35) 0.98 [ -1.85]( 4.26)
Triad 1.00 [ 0.00]( 2.30) 0.99 [ -0.84]( 1.83)
==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
1-clients 1.00 [ 0.00]( 0.17) 0.97 [ -2.55]( 0.50)
2-clients 1.00 [ 0.00]( 0.77) 0.97 [ -2.52]( 0.20)
4-clients 1.00 [ 0.00]( 0.93) 0.97 [ -3.30]( 0.54)
8-clients 1.00 [ 0.00]( 0.87) 0.96 [ -3.98]( 1.19)
16-clients 1.00 [ 0.00]( 1.15) 0.96 [ -4.16]( 1.06)
32-clients 1.00 [ 0.00]( 1.00) 0.95 [ -5.47]( 0.96)
64-clients 1.00 [ 0.00]( 1.37) 0.94 [ -5.75]( 1.64)
128-clients 1.00 [ 0.00]( 0.99) 0.92 [ -8.50]( 1.49)
256-clients 1.00 [ 0.00]( 3.23) 0.90 [-10.22]( 2.86)
512-clients 1.00 [ 0.00](58.43) 0.90 [-10.28](47.59)
==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
1 1.00 [ -0.00]( 5.59) 0.55 [ 45.00](11.17)
2 1.00 [ -0.00](14.29) 0.52 [ 47.62]( 7.53)
4 1.00 [ -0.00]( 1.24) 0.57 [ 42.55]( 5.73)
8 1.00 [ -0.00](11.16) 1.06 [ -6.12]( 2.92)
16 1.00 [ -0.00]( 6.81) 1.12 [-12.28](11.09)
32 1.00 [ -0.00]( 6.99) 1.05 [ -5.26](12.48)
64 1.00 [ -0.00]( 6.00) 0.96 [ 4.21](18.31)
128 1.00 [ -0.00]( 3.26) 1.63 [-62.84](36.71)
256 1.00 [ -0.00](19.29) 0.97 [ 3.25]( 4.94)
512 1.00 [ -0.00]( 1.48) 1.05 [ -4.71]( 5.11)
==================================================================
Test : new-schbench-requests-per-second
Units : Normalized Requests per second
Interpretation: Higher is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
1 1.00 [ 0.00]( 0.00) 0.95 [ -4.99]( 0.48)
2 1.00 [ 0.00]( 0.26) 0.96 [ -3.82]( 0.55)
4 1.00 [ 0.00]( 0.15) 0.95 [ -4.96]( 0.27)
8 1.00 [ 0.00]( 0.15) 0.99 [ -0.58]( 0.00)
16 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.15)
32 1.00 [ 0.00]( 4.88) 1.04 [ 4.27]( 2.42)
64 1.00 [ 0.00]( 5.57) 0.87 [-13.10](11.51)
128 1.00 [ 0.00]( 0.34) 0.97 [ -3.13]( 0.58)
256 1.00 [ 0.00]( 1.95) 1.02 [ 1.83]( 0.15)
512 1.00 [ 0.00]( 0.44) 1.00 [ 0.48]( 0.12)
==================================================================
Test : new-schbench-wakeup-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
1 1.00 [ -0.00]( 4.19) 1.00 [ -0.00](14.91)
2 1.00 [ -0.00]( 3.78) 0.93 [ 7.14]( 0.00)
4 1.00 [ -0.00]( 8.91) 0.80 [ 20.00]( 4.43)
8 1.00 [ -0.00]( 7.45) 1.00 [ -0.00]( 7.45)
16 1.00 [ -0.00]( 4.08) 1.00 [ -0.00](10.79)
32 1.00 [ -0.00](16.90) 0.93 [ 6.67](10.00)
64 1.00 [ -0.00]( 9.11) 1.12 [-12.50]( 0.00)
128 1.00 [ -0.00]( 7.05) 2.43 [-142.86](24.47)
256 1.00 [ -0.00]( 4.32) 1.02 [ -2.34]( 1.20)
512 1.00 [ -0.00]( 0.35) 1.01 [ -0.77]( 0.40)
==================================================================
Test : new-schbench-request-latency
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
1 1.00 [ -0.00]( 0.78) 1.16 [-15.70]( 2.14)
2 1.00 [ -0.00]( 0.81) 1.13 [-13.11]( 0.62)
4 1.00 [ -0.00]( 0.24) 1.26 [-26.11](16.43)
8 1.00 [ -0.00]( 1.30) 1.03 [ -3.46]( 0.81)
16 1.00 [ -0.00]( 1.11) 1.02 [ -2.12]( 1.85)
32 1.00 [ -0.00]( 5.94) 0.96 [ 4.05]( 4.48)
64 1.00 [ -0.00]( 6.27) 1.06 [ -6.01]( 6.67)
128 1.00 [ -0.00]( 0.21) 1.12 [-12.31]( 2.61)
256 1.00 [ -0.00](13.73) 1.06 [ -6.30]( 3.37)
512 1.00 [ -0.00]( 0.95) 1.05 [ -4.85]( 0.61)
==================================================================
Test : Various longer running benchmarks
Units : %diff in throughput reported
Interpretation: Higher is better
Statistic : Median
==================================================================
Benchmarks: %diff
ycsb-cassandra -1.21%
ycsb-mongodb -0.69%
deathstarbench-1x -7.40%
deathstarbench-2x -3.80%
deathstarbench-3x -3.99%
deathstarbench-6x -3.02%
hammerdb+mysql 16VU -2.59%
hammerdb+mysql 64VU -1.05%
Also, could you fold the below diff into your Patch2:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb5a2572b4f8..6c51dd2b7b32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7694,8 +7694,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
int i, cpu, idle_cpu = -1, nr = INT_MAX;
struct sched_domain_shared *sd_share;
- cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
if (sched_feat(SIS_UTIL)) {
sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
if (sd_share) {
@@ -7707,6 +7705,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
}
}
+ cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
if (static_branch_unlikely(&sched_cluster_active)) {
struct sched_group *sg = sd->groups;
---
If the SIS_UTIL cut off hits, that result of the cpumask_and() is of no
use. To save some additional cycles, especially in cases where we target
the LLC frequently and the search bails out because the LLC is busy,
this overhead can be easily avoided. Since select_idle_cpu() can now be
called twice per wakeup, this overhead can be visible in benchmarks like
hackbench.
--
Thanks and Regards,
Prateek
> schbench was tested on EMR and Zen3 Milan. An improvement in tail latency was observed when
> the LLC was underloaded; however, some regressions were still evident when the LLC was
> saturated. Additionally, the load balance should be adjusted to further address these
> regressions.
>
> [1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
>
>
> Chen Yu (4):
> sched: Several fixes for cache aware scheduling
> sched: Avoid task migration within its preferred LLC
> sched: Inhibit cache aware scheduling if the preferred LLC is over
> aggregated
> sched: Add ftrace to track task migration and load balance within and
> across LLC
>
> Peter Zijlstra (1):
> sched: Cache aware load-balancing
>
Powered by blists - more mailing lists