linux-kernel - Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7c5fcd32-1f0f-4148-ab0e-0a25ea11c10f@amd.com>
Date: Tue, 29 Apr 2025 09:17:43 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Chen Yu <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>,
	"Ingo Molnar" <mingo@...hat.com>, "Gautham R . Shenoy"
	<gautham.shenoy@....com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>, Abel Wu
	<wuyun.abel@...edance.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
	Hillf Danton <hdanton@...a.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC PATCH 0/5] sched: Introduce Cache aware scheduling

Hello Chenyu,

On 4/21/2025 8:53 AM, Chen Yu wrote:
> This is a respin of the cache-aware scheduling proposed by Peter[1].
> In this patch set, some known issues in [1] were addressed, and the performance
> regression was investigated and mitigated.
> 
> Cache-aware scheduling aims to aggregate tasks with potential shared resources
> into the same cache domain. This approach enhances cache locality, thereby optimizing
> system performance by reducing cache misses and improving data access efficiency.
> 
> In the current implementation, threads within the same process are considered as
> entities that potentially share resources. Cache-aware scheduling monitors the CPU
> occupancy of each cache domain for every process. Based on this monitoring, it endeavors
> to migrate threads within a given process to its cache-hot domains, with the goal of
> maximizing cache locality.
> 
> Patch 1 constitutes the fundamental cache-aware scheduling. It is the same patch as [1].
> Patch 2 comprises a series of fixes for Patch 1, including compiling warnings and functional
> fixes.
> Patch 3 fixes performance degradation that arise from excessive task migrations within the
> preferred LLC domain.
> Patch 4 further alleviates performance regressions when the preferred LLC becomes saturated.
> Patch 5 introduces ftrace events, which is used to track task migrations triggered by wakeup
> and load balancer. This addition facilitate performance regression analysis.
> 
> The patch set is applied on top of v6.14 sched/core,
> commit 4ba7518327c6 ("sched/debug: Print the local group's asym_prefer_cpu")
> 

Thank you for working on this! I have been a bit preoccupied but I
promise to look into the regressions I've reported below sometime
this week and report back soon on what seems to make them unhappy.

tl;dr

o Most regressions aren't as severe as v1 thanks to all the work
   from you and Abel.

o I too see schbench regress in fully loaded cases but the old
   schbench tail latencies improve when #threads < #CPUs in LLC

o There is a consistent regression in tbench - what I presume is
   happening there is all threads of "tbench_srv" share an mm and
   and all the tbench clients share an mm but for best performance,
   the wakeups between client and server must be local (same core /
   same LLC) but either the cost of additional search build up or
   the clients get co-located as one set of entities and the
   servers get colocated as another set of entities leading to
   mostly remote wakeups.

   Not too sure if netperf has similar architecture as tbench but
   that too sees a regression.

o Longer running benchmarks see a regression. Still not sure if
   this is because of additional search or something else.

I'll leave the full results below:

o Machine details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- C2 Disabled (POLL and C1(MWAIT) remained enabled)

o Benchmark results

   ==================================================================
   Test          : hackbench
   Units         : Normalized time in seconds
   Interpretation: Lower is better
   Statistic     : AMean
   ==================================================================
   Case:           tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    1-groups     1.00 [ -0.00]( 9.02)     1.03 [ -3.38](11.44)
    2-groups     1.00 [ -0.00]( 6.86)     0.98 [  2.20]( 6.61)
    4-groups     1.00 [ -0.00]( 2.73)     1.00 [  0.42]( 4.00)
    8-groups     1.00 [ -0.00]( 1.21)     1.04 [ -4.00]( 5.59)
   16-groups     1.00 [ -0.00]( 0.97)     1.01 [ -0.52]( 2.12)


   ==================================================================
   Test          : tbench
   Units         : Normalized throughput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:    tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
       1     1.00 [  0.00]( 0.67)     0.96 [ -3.95]( 0.55)
       2     1.00 [  0.00]( 0.85)     0.98 [ -1.69]( 0.65)
       4     1.00 [  0.00]( 0.52)     0.96 [ -3.68]( 0.09)
       8     1.00 [  0.00]( 0.92)     0.96 [ -4.06]( 0.43)
      16     1.00 [  0.00]( 1.01)     0.95 [ -5.19]( 1.65)
      32     1.00 [  0.00]( 1.35)     0.95 [ -4.79]( 0.29)
      64     1.00 [  0.00]( 1.22)     0.94 [ -6.49]( 1.46)
     128     1.00 [  0.00]( 2.39)     0.92 [ -7.61]( 1.41)
     256     1.00 [  0.00]( 1.83)     0.92 [ -8.24]( 0.35)
     512     1.00 [  0.00]( 0.17)     0.93 [ -7.08]( 0.22)
    1024     1.00 [  0.00]( 0.31)     0.91 [ -8.57]( 0.29)


   ==================================================================
   Test          : stream-10
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    Copy     1.00 [  0.00]( 8.24)     1.03 [  2.66]( 6.15)
   Scale     1.00 [  0.00]( 5.62)     0.99 [ -1.43]( 6.32)
     Add     1.00 [  0.00]( 6.18)     0.97 [ -3.12]( 5.70)
   Triad     1.00 [  0.00]( 5.29)     1.01 [  1.31]( 3.82)


   ==================================================================
   Test          : stream-100
   Units         : Normalized Bandwidth, MB/s
   Interpretation: Higher is better
   Statistic     : HMean
   ==================================================================
   Test:       tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    Copy     1.00 [  0.00]( 2.92)     0.99 [ -1.47]( 5.02)
   Scale     1.00 [  0.00]( 4.80)     0.98 [ -2.08]( 5.53)
     Add     1.00 [  0.00]( 4.35)     0.98 [ -1.85]( 4.26)
   Triad     1.00 [  0.00]( 2.30)     0.99 [ -0.84]( 1.83)


   ==================================================================
   Test          : netperf
   Units         : Normalized Througput
   Interpretation: Higher is better
   Statistic     : AMean
   ==================================================================
   Clients:         tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
    1-clients     1.00 [  0.00]( 0.17)     0.97 [ -2.55]( 0.50)
    2-clients     1.00 [  0.00]( 0.77)     0.97 [ -2.52]( 0.20)
    4-clients     1.00 [  0.00]( 0.93)     0.97 [ -3.30]( 0.54)
    8-clients     1.00 [  0.00]( 0.87)     0.96 [ -3.98]( 1.19)
   16-clients     1.00 [  0.00]( 1.15)     0.96 [ -4.16]( 1.06)
   32-clients     1.00 [  0.00]( 1.00)     0.95 [ -5.47]( 0.96)
   64-clients     1.00 [  0.00]( 1.37)     0.94 [ -5.75]( 1.64)
   128-clients    1.00 [  0.00]( 0.99)     0.92 [ -8.50]( 1.49)
   256-clients    1.00 [  0.00]( 3.23)     0.90 [-10.22]( 2.86)
   512-clients    1.00 [  0.00](58.43)     0.90 [-10.28](47.59)


   ==================================================================
   Test          : schbench
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [ -0.00]( 5.59)     0.55 [ 45.00](11.17)
     2     1.00 [ -0.00](14.29)     0.52 [ 47.62]( 7.53)
     4     1.00 [ -0.00]( 1.24)     0.57 [ 42.55]( 5.73)
     8     1.00 [ -0.00](11.16)     1.06 [ -6.12]( 2.92)
    16     1.00 [ -0.00]( 6.81)     1.12 [-12.28](11.09)
    32     1.00 [ -0.00]( 6.99)     1.05 [ -5.26](12.48)
    64     1.00 [ -0.00]( 6.00)     0.96 [  4.21](18.31)
   128     1.00 [ -0.00]( 3.26)     1.63 [-62.84](36.71)
   256     1.00 [ -0.00](19.29)     0.97 [  3.25]( 4.94)
   512     1.00 [ -0.00]( 1.48)     1.05 [ -4.71]( 5.11)


   ==================================================================
   Test          : new-schbench-requests-per-second
   Units         : Normalized Requests per second
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [  0.00]( 0.00)     0.95 [ -4.99]( 0.48)
     2     1.00 [  0.00]( 0.26)     0.96 [ -3.82]( 0.55)
     4     1.00 [  0.00]( 0.15)     0.95 [ -4.96]( 0.27)
     8     1.00 [  0.00]( 0.15)     0.99 [ -0.58]( 0.00)
    16     1.00 [  0.00]( 0.00)     1.00 [ -0.29]( 0.15)
    32     1.00 [  0.00]( 4.88)     1.04 [  4.27]( 2.42)
    64     1.00 [  0.00]( 5.57)     0.87 [-13.10](11.51)
   128     1.00 [  0.00]( 0.34)     0.97 [ -3.13]( 0.58)
   256     1.00 [  0.00]( 1.95)     1.02 [  1.83]( 0.15)
   512     1.00 [  0.00]( 0.44)     1.00 [  0.48]( 0.12)


   ==================================================================
   Test          : new-schbench-wakeup-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [ -0.00]( 4.19)     1.00 [ -0.00](14.91)
     2     1.00 [ -0.00]( 3.78)     0.93 [  7.14]( 0.00)
     4     1.00 [ -0.00]( 8.91)     0.80 [ 20.00]( 4.43)
     8     1.00 [ -0.00]( 7.45)     1.00 [ -0.00]( 7.45)
    16     1.00 [ -0.00]( 4.08)     1.00 [ -0.00](10.79)
    32     1.00 [ -0.00](16.90)     0.93 [  6.67](10.00)
    64     1.00 [ -0.00]( 9.11)     1.12 [-12.50]( 0.00)
   128     1.00 [ -0.00]( 7.05)     2.43 [-142.86](24.47)
   256     1.00 [ -0.00]( 4.32)     1.02 [ -2.34]( 1.20)
   512     1.00 [ -0.00]( 0.35)     1.01 [ -0.77]( 0.40)


   ==================================================================
   Test          : new-schbench-request-latency
   Units         : Normalized 99th percentile latency in us
   Interpretation: Lower is better
   Statistic     : Median
   ==================================================================
   #workers: tip[pct imp](CV)    cache_aware_lb[pct imp](CV)
     1     1.00 [ -0.00]( 0.78)     1.16 [-15.70]( 2.14)
     2     1.00 [ -0.00]( 0.81)     1.13 [-13.11]( 0.62)
     4     1.00 [ -0.00]( 0.24)     1.26 [-26.11](16.43)
     8     1.00 [ -0.00]( 1.30)     1.03 [ -3.46]( 0.81)
    16     1.00 [ -0.00]( 1.11)     1.02 [ -2.12]( 1.85)
    32     1.00 [ -0.00]( 5.94)     0.96 [  4.05]( 4.48)
    64     1.00 [ -0.00]( 6.27)     1.06 [ -6.01]( 6.67)
   128     1.00 [ -0.00]( 0.21)     1.12 [-12.31]( 2.61)
   256     1.00 [ -0.00](13.73)     1.06 [ -6.30]( 3.37)
   512     1.00 [ -0.00]( 0.95)     1.05 [ -4.85]( 0.61)


   ==================================================================
   Test          : Various longer running benchmarks
   Units         : %diff in throughput reported
   Interpretation: Higher is better
   Statistic     : Median
   ==================================================================
   Benchmarks:                 %diff
   ycsb-cassandra              -1.21%
   ycsb-mongodb                -0.69%

   deathstarbench-1x           -7.40%
   deathstarbench-2x           -3.80%
   deathstarbench-3x           -3.99%
   deathstarbench-6x           -3.02%

   hammerdb+mysql 16VU         -2.59%
   hammerdb+mysql 64VU         -1.05%


Also, could you fold the below diff into your Patch2:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb5a2572b4f8..6c51dd2b7b32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7694,8 +7694,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
  	int i, cpu, idle_cpu = -1, nr = INT_MAX;
  	struct sched_domain_shared *sd_share;
  
-	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
-
  	if (sched_feat(SIS_UTIL)) {
  		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
  		if (sd_share) {
@@ -7707,6 +7705,8 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
  		}
  	}
  
+	cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
+
  	if (static_branch_unlikely(&sched_cluster_active)) {
  		struct sched_group *sg = sd->groups;
  
---

If the SIS_UTIL cut off hits, that result of the cpumask_and() is of no
use. To save some additional cycles, especially in cases where we target
the LLC frequently and the search bails out because the LLC is busy,
this overhead can be easily avoided. Since select_idle_cpu() can now be
called twice per wakeup, this overhead can be visible in benchmarks like
hackbench.

-- 
Thanks and Regards,
Prateek

> schbench was tested on EMR and Zen3 Milan. An improvement in tail latency was observed when
> the LLC was underloaded; however, some regressions were still evident when the LLC was
> saturated. Additionally, the load balance should be adjusted to further address these
> regressions.
> 
> [1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> 
> 
> Chen Yu (4):
>    sched: Several fixes for cache aware scheduling
>    sched: Avoid task migration within its preferred LLC
>    sched: Inhibit cache aware scheduling if the preferred LLC is over
>      aggregated
>    sched: Add ftrace to track task migration and load balance within and
>      across LLC
> 
> Peter Zijlstra (1):
>    sched: Cache aware load-balancing
>