[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2225c999-8d06-40a9-9d55-76d2cfabacb8@intel.com>
Date: Tue, 29 Apr 2025 20:57:11 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>, Abel Wu
<wuyun.abel@...edance.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, <linux-kernel@...r.kernel.org>, "Peter
Zijlstra" <peterz@...radead.org>, "Gautham R . Shenoy"
<gautham.shenoy@....com>, Ingo Molnar <mingo@...hat.com>, Len Brown
<len.brown@...el.com>
Subject: Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling
Hi Prateek,
On 4/29/2025 11:47 AM, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 4/21/2025 8:53 AM, Chen Yu wrote:
>> This is a respin of the cache-aware scheduling proposed by Peter[1].
>> In this patch set, some known issues in [1] were addressed, and the
>> performance
>> regression was investigated and mitigated.
>>
>> Cache-aware scheduling aims to aggregate tasks with potential shared
>> resources
>> into the same cache domain. This approach enhances cache locality,
>> thereby optimizing
>> system performance by reducing cache misses and improving data access
>> efficiency.
>>
>> In the current implementation, threads within the same process are
>> considered as
>> entities that potentially share resources. Cache-aware scheduling
>> monitors the CPU
>> occupancy of each cache domain for every process. Based on this
>> monitoring, it endeavors
>> to migrate threads within a given process to its cache-hot domains,
>> with the goal of
>> maximizing cache locality.
>>
>> Patch 1 constitutes the fundamental cache-aware scheduling. It is the
>> same patch as [1].
>> Patch 2 comprises a series of fixes for Patch 1, including compiling
>> warnings and functional
>> fixes.
>> Patch 3 fixes performance degradation that arise from excessive task
>> migrations within the
>> preferred LLC domain.
>> Patch 4 further alleviates performance regressions when the preferred
>> LLC becomes saturated.
>> Patch 5 introduces ftrace events, which is used to track task
>> migrations triggered by wakeup
>> and load balancer. This addition facilitate performance regression
>> analysis.
>>
>> The patch set is applied on top of v6.14 sched/core,
>> commit 4ba7518327c6 ("sched/debug: Print the local group's
>> asym_prefer_cpu")
>>
>
> Thank you for working on this! I have been a bit preoccupied but I
> promise to look into the regressions I've reported below sometime
> this week and report back soon on what seems to make them unhappy.
>
Thanks for your time on this testings.
> tl;dr
>
> o Most regressions aren't as severe as v1 thanks to all the work
> from you and Abel.
>
> o I too see schbench regress in fully loaded cases but the old
> schbench tail latencies improve when #threads < #CPUs in LLC
>
> o There is a consistent regression in tbench - what I presume is
> happening there is all threads of "tbench_srv" share an mm and
> and all the tbench clients share an mm but for best performance,
> the wakeups between client and server must be local (same core /
> same LLC) but either the cost of additional search build up or
> the clients get co-located as one set of entities and the
> servers get colocated as another set of entities leading to
> mostly remote wakeups.
This is a good point. If A and B are both multi-threaded processes,
and A interacts with B frequently, we should not only consider
aggregating the threads within A and B, but also placing A and
B together. I'm not sure if WF_SYNC is carried along and takes
effect during the tbench socket wakeup process. I'll also try
tbench/netperf testings.
>
> Not too sure if netperf has similar architecture as tbench but
> that too sees a regression.
>
> o Longer running benchmarks see a regression. Still not sure if
> this is because of additional search or something else.
>
> I'll leave the full results below:
>
> o Machine details
>
> - 3rd Generation EPYC System
> - 2 sockets each with 64C/128T
> - NPS1 (Each socket is a NUMA node)
> - C2 Disabled (POLL and C1(MWAIT) remained enabled)
>
> o Benchmark results
>
> ==================================================================
> Test : hackbench
> Units : Normalized time in seconds
> Interpretation: Lower is better
> Statistic : AMean
> ==================================================================
> Case: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
> 1-groups 1.00 [ -0.00]( 9.02) 1.03 [ -3.38](11.44)
> 2-groups 1.00 [ -0.00]( 6.86) 0.98 [ 2.20]( 6.61)
> 4-groups 1.00 [ -0.00]( 2.73) 1.00 [ 0.42]( 4.00)
> 8-groups 1.00 [ -0.00]( 1.21) 1.04 [ -4.00]( 5.59)
> 16-groups 1.00 [ -0.00]( 0.97) 1.01 [ -0.52]( 2.12)
>
>
> ==================================================================
> Test : tbench
> Units : Normalized throughput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
> 1 1.00 [ 0.00]( 0.67) 0.96 [ -3.95]( 0.55)
> 2 1.00 [ 0.00]( 0.85) 0.98 [ -1.69]( 0.65)
> 4 1.00 [ 0.00]( 0.52) 0.96 [ -3.68]( 0.09)
> 8 1.00 [ 0.00]( 0.92) 0.96 [ -4.06]( 0.43)
> 16 1.00 [ 0.00]( 1.01) 0.95 [ -5.19]( 1.65)
> 32 1.00 [ 0.00]( 1.35) 0.95 [ -4.79]( 0.29)
> 64 1.00 [ 0.00]( 1.22) 0.94 [ -6.49]( 1.46)
> 128 1.00 [ 0.00]( 2.39) 0.92 [ -7.61]( 1.41)
> 256 1.00 [ 0.00]( 1.83) 0.92 [ -8.24]( 0.35)
> 512 1.00 [ 0.00]( 0.17) 0.93 [ -7.08]( 0.22)
> 1024 1.00 [ 0.00]( 0.31) 0.91 [ -8.57]( 0.29)
>
>
> ==================================================================
> Test : stream-10
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
> Copy 1.00 [ 0.00]( 8.24) 1.03 [ 2.66]( 6.15)
> Scale 1.00 [ 0.00]( 5.62) 0.99 [ -1.43]( 6.32)
> Add 1.00 [ 0.00]( 6.18) 0.97 [ -3.12]( 5.70)
> Triad 1.00 [ 0.00]( 5.29) 1.01 [ 1.31]( 3.82)
>
>
> ==================================================================
> Test : stream-100
> Units : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic : HMean
> ==================================================================
> Test: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
> Copy 1.00 [ 0.00]( 2.92) 0.99 [ -1.47]( 5.02)
> Scale 1.00 [ 0.00]( 4.80) 0.98 [ -2.08]( 5.53)
> Add 1.00 [ 0.00]( 4.35) 0.98 [ -1.85]( 4.26)
> Triad 1.00 [ 0.00]( 2.30) 0.99 [ -0.84]( 1.83)
>
>
> ==================================================================
> Test : netperf
> Units : Normalized Througput
> Interpretation: Higher is better
> Statistic : AMean
> ==================================================================
> Clients: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
> 1-clients 1.00 [ 0.00]( 0.17) 0.97 [ -2.55]( 0.50)
> 2-clients 1.00 [ 0.00]( 0.77) 0.97 [ -2.52]( 0.20)
> 4-clients 1.00 [ 0.00]( 0.93) 0.97 [ -3.30]( 0.54)
> 8-clients 1.00 [ 0.00]( 0.87) 0.96 [ -3.98]( 1.19)
> 16-clients 1.00 [ 0.00]( 1.15) 0.96 [ -4.16]( 1.06)
> 32-clients 1.00 [ 0.00]( 1.00) 0.95 [ -5.47]( 0.96)
> 64-clients 1.00 [ 0.00]( 1.37) 0.94 [ -5.75]( 1.64)
> 128-clients 1.00 [ 0.00]( 0.99) 0.92 [ -8.50]( 1.49)
> 256-clients 1.00 [ 0.00]( 3.23) 0.90 [-10.22]( 2.86)
> 512-clients 1.00 [ 0.00](58.43) 0.90 [-10.28](47.59)
>
>
> ==================================================================
> Test : schbench
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
> 1 1.00 [ -0.00]( 5.59) 0.55 [ 45.00](11.17)
> 2 1.00 [ -0.00](14.29) 0.52 [ 47.62]( 7.53)
> 4 1.00 [ -0.00]( 1.24) 0.57 [ 42.55]( 5.73)
> 8 1.00 [ -0.00](11.16) 1.06 [ -6.12]( 2.92)
> 16 1.00 [ -0.00]( 6.81) 1.12 [-12.28](11.09)
> 32 1.00 [ -0.00]( 6.99) 1.05 [ -5.26](12.48)
> 64 1.00 [ -0.00]( 6.00) 0.96 [ 4.21](18.31)
> 128 1.00 [ -0.00]( 3.26) 1.63 [-62.84](36.71)
> 256 1.00 [ -0.00](19.29) 0.97 [ 3.25]( 4.94)
> 512 1.00 [ -0.00]( 1.48) 1.05 [ -4.71]( 5.11)
>
>
> ==================================================================
> Test : new-schbench-requests-per-second
> Units : Normalized Requests per second
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
> 1 1.00 [ 0.00]( 0.00) 0.95 [ -4.99]( 0.48)
> 2 1.00 [ 0.00]( 0.26) 0.96 [ -3.82]( 0.55)
> 4 1.00 [ 0.00]( 0.15) 0.95 [ -4.96]( 0.27)
> 8 1.00 [ 0.00]( 0.15) 0.99 [ -0.58]( 0.00)
> 16 1.00 [ 0.00]( 0.00) 1.00 [ -0.29]( 0.15)
> 32 1.00 [ 0.00]( 4.88) 1.04 [ 4.27]( 2.42)
> 64 1.00 [ 0.00]( 5.57) 0.87 [-13.10](11.51)
> 128 1.00 [ 0.00]( 0.34) 0.97 [ -3.13]( 0.58)
> 256 1.00 [ 0.00]( 1.95) 1.02 [ 1.83]( 0.15)
> 512 1.00 [ 0.00]( 0.44) 1.00 [ 0.48]( 0.12)
>
>
> ==================================================================
> Test : new-schbench-wakeup-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
> 1 1.00 [ -0.00]( 4.19) 1.00 [ -0.00](14.91)
> 2 1.00 [ -0.00]( 3.78) 0.93 [ 7.14]( 0.00)
> 4 1.00 [ -0.00]( 8.91) 0.80 [ 20.00]( 4.43)
> 8 1.00 [ -0.00]( 7.45) 1.00 [ -0.00]( 7.45)
> 16 1.00 [ -0.00]( 4.08) 1.00 [ -0.00](10.79)
> 32 1.00 [ -0.00](16.90) 0.93 [ 6.67](10.00)
> 64 1.00 [ -0.00]( 9.11) 1.12 [-12.50]( 0.00)
> 128 1.00 [ -0.00]( 7.05) 2.43 [-142.86](24.47)
OK, this was what I saw too. I'm looking into this.
> 256 1.00 [ -0.00]( 4.32) 1.02 [ -2.34]( 1.20)
> 512 1.00 [ -0.00]( 0.35) 1.01 [ -0.77]( 0.40)
>
>
> ==================================================================
> Test : new-schbench-request-latency
> Units : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic : Median
> ==================================================================
> #workers: tip[pct imp](CV) cache_aware_lb[pct imp](CV)
> 1 1.00 [ -0.00]( 0.78) 1.16 [-15.70]( 2.14)
> 2 1.00 [ -0.00]( 0.81) 1.13 [-13.11]( 0.62)
> 4 1.00 [ -0.00]( 0.24) 1.26 [-26.11](16.43)
> 8 1.00 [ -0.00]( 1.30) 1.03 [ -3.46]( 0.81)
> 16 1.00 [ -0.00]( 1.11) 1.02 [ -2.12]( 1.85)
> 32 1.00 [ -0.00]( 5.94) 0.96 [ 4.05]( 4.48)
> 64 1.00 [ -0.00]( 6.27) 1.06 [ -6.01]( 6.67)
> 128 1.00 [ -0.00]( 0.21) 1.12 [-12.31]( 2.61)
> 256 1.00 [ -0.00](13.73) 1.06 [ -6.30]( 3.37)
> 512 1.00 [ -0.00]( 0.95) 1.05 [ -4.85]( 0.61)
>
>
> ==================================================================
> Test : Various longer running benchmarks
> Units : %diff in throughput reported
> Interpretation: Higher is better
> Statistic : Median
> ==================================================================
> Benchmarks: %diff
> ycsb-cassandra -1.21%
> ycsb-mongodb -0.69%
>
> deathstarbench-1x -7.40%
> deathstarbench-2x -3.80%
> deathstarbench-3x -3.99%
> deathstarbench-6x -3.02%
>
> hammerdb+mysql 16VU -2.59%
> hammerdb+mysql 64VU -1.05%
>
For long-duration task, the penalty of remote cache access is severe.
This might indicate a similar issue as tbench/netperf as you mentioned,
different processes are aggregated to different LLCs, but these
processes interact with each other and WF_SYNC did not take effect.
>
> Also, could you fold the below diff into your Patch2:
>
Sure, let me apply it and do the test.
thanks,
Chenyu
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eb5a2572b4f8..6c51dd2b7b32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7694,8 +7694,6 @@ static int select_idle_cpu(struct task_struct *p,
> struct sched_domain *sd, bool
> int i, cpu, idle_cpu = -1, nr = INT_MAX;
> struct sched_domain_shared *sd_share;
>
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> -
> if (sched_feat(SIS_UTIL)) {
> sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> if (sd_share) {
> @@ -7707,6 +7705,8 @@ static int select_idle_cpu(struct task_struct *p,
> struct sched_domain *sd, bool
> }
> }
>
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
> if (static_branch_unlikely(&sched_cluster_active)) {
> struct sched_group *sg = sd->groups;
>
> ---
>
> If the SIS_UTIL cut off hits, that result of the cpumask_and() is of no
> use. To save some additional cycles, especially in cases where we target
> the LLC frequently and the search bails out because the LLC is busy,
> this overhead can be easily avoided. Since select_idle_cpu() can now be
> called twice per wakeup, this overhead can be visible in benchmarks like
> hackbench.
>
Powered by blists - more mailing lists