linux-kernel - Re: [RFC PATCH v4 25/28] sched: Skip cache aware scheduling if the process has many active threads

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d6a19a7d-e435-4a27-9725-b2fb802a52fd@intel.com>
Date: Tue, 2 Sep 2025 14:14:14 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Tingyin Duan <tingyin.duan@...il.com>
CC: <aubrey.li@...el.com>, <bsegall@...gle.com>, <cyy@...self.name>,
	<dietmar.eggemann@....com>, <gautham.shenoy@....com>, <hdanton@...a.com>,
	<jianyong.wu@...look.com>, <juri.lelli@...hat.com>, <kprateek.nayak@....com>,
	<len.brown@...el.com>, <libo.chen@...cle.com>,
	<linux-kernel@...r.kernel.org>, <mgorman@...e.de>, <mingo@...hat.com>,
	<peterz@...radead.org>, <rostedt@...dmis.org>, <sshegde@...ux.ibm.com>,
	<tim.c.chen@...ux.intel.com>, <vernhao@...cent.com>,
	<vincent.guittot@...aro.org>, <vineethr@...ux.ibm.com>,
	<vschneid@...hat.com>, <yu.chen.surf@...il.com>, <zhao1.liu@...el.com>
Subject: Re: [RFC PATCH v4 25/28] sched: Skip cache aware scheduling if the
 process has many active threads

On 9/2/2025 1:16 PM, Tingyin Duan wrote:
> Several different test cases with mysql and sysbench shows that this patch
> causes about 10% performance regressions on my computer with 256 cores.
> Perf-top shows exceed_llc_nr is high. Could you help to address this problems?

Thanks for bringing this to public for discussion. As we synced offline, the
performance regression is likely to be caused by the cache contention 
introduced
by the [25/28] patch:

The 1st issue:Multiple threads within the same process try to read the 
mm_struct->nr_running_avg
while the task_cache_work() modifies the mm_struct->mm_sched_cpu from 
time to time.
Since the mm_sched_cpu and nr_running_avg are in the same cacheline, 
this update
turns the cacheline into "Modified" and the read triggers the costly 
"HITM" event.
We should move nr_running_avg and mm_sched_cpu to different cachelines.

The 2nd issue:
If nr_running_avg remains consistently above the threshold in your test 
case,
exceed_llc_nr() will always return true. This might cause the frequent 
write of -1
to mm->mm_sched_cpu, even if mm->mm_sched_cpu is already -1. This causes 
another
cache contention that threads on other CPUs trying to read the 
mm->mm_sched_cpu.
We should update the mm_struct's mm_sched_cpu field only if the value 
has been
changed.

That is to say, the following patch should fix the regression:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2cca039d6e4f..3c1c50134647 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1032,7 +1032,11 @@ struct mm_struct {
  		raw_spinlock_t mm_sched_lock;
  		unsigned long mm_sched_epoch;
  		int mm_sched_cpu;
-		u64 nr_running_avg;
+		/*
+		 * mm_sched_cpu and nr_running_avg are put into seperate
+		 * cachelines to avoid cache contention.
+		 */
+		u64 nr_running_avg ____cacheline_aligned_in_smp;
  #endif

  #ifdef CONFIG_MMU
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 026013c826d9..4ef28db57a37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1428,7 +1428,8 @@ void account_mm_sched(struct rq *rq, struct 
task_struct *p, s64 delta_exec)
  	    get_nr_threads(p) <= 1 ||
  	    exceed_llc_nr(mm, cpu_of(rq)) ||
  	    exceed_llc_capacity(mm, cpu_of(rq))) {
-		mm->mm_sched_cpu = -1;
+		if (mm->mm_sched_cpu != -1)
+			mm->mm_sched_cpu = -1;
  		pcpu_sched->occ = 0;
  	}

-- 
2.25.1

With above change, the regression I mentioned in the cover letter, when
running multiple instances of hackbench on AMD Milan has disappeared.
And max latency improvement of sysbench+MariaDB are observed on Milan:
transactions per sec.: -0.72%
min latency: -0.00%
avg latency: -0.00%
max latency: +78.90%
95th percentile: -0.00%
events avg: -0.72%
events stddev: +50.72%

thanks,
Chenyu