[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d6a19a7d-e435-4a27-9725-b2fb802a52fd@intel.com>
Date: Tue, 2 Sep 2025 14:14:14 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Tingyin Duan <tingyin.duan@...il.com>
CC: <aubrey.li@...el.com>, <bsegall@...gle.com>, <cyy@...self.name>,
<dietmar.eggemann@....com>, <gautham.shenoy@....com>, <hdanton@...a.com>,
<jianyong.wu@...look.com>, <juri.lelli@...hat.com>, <kprateek.nayak@....com>,
<len.brown@...el.com>, <libo.chen@...cle.com>,
<linux-kernel@...r.kernel.org>, <mgorman@...e.de>, <mingo@...hat.com>,
<peterz@...radead.org>, <rostedt@...dmis.org>, <sshegde@...ux.ibm.com>,
<tim.c.chen@...ux.intel.com>, <vernhao@...cent.com>,
<vincent.guittot@...aro.org>, <vineethr@...ux.ibm.com>,
<vschneid@...hat.com>, <yu.chen.surf@...il.com>, <zhao1.liu@...el.com>
Subject: Re: [RFC PATCH v4 25/28] sched: Skip cache aware scheduling if the
process has many active threads
On 9/2/2025 1:16 PM, Tingyin Duan wrote:
> Several different test cases with mysql and sysbench shows that this patch
> causes about 10% performance regressions on my computer with 256 cores.
> Perf-top shows exceed_llc_nr is high. Could you help to address this problems?
Thanks for bringing this to public for discussion. As we synced offline, the
performance regression is likely to be caused by the cache contention
introduced
by the [25/28] patch:
The 1st issue:Multiple threads within the same process try to read the
mm_struct->nr_running_avg
while the task_cache_work() modifies the mm_struct->mm_sched_cpu from
time to time.
Since the mm_sched_cpu and nr_running_avg are in the same cacheline,
this update
turns the cacheline into "Modified" and the read triggers the costly
"HITM" event.
We should move nr_running_avg and mm_sched_cpu to different cachelines.
The 2nd issue:
If nr_running_avg remains consistently above the threshold in your test
case,
exceed_llc_nr() will always return true. This might cause the frequent
write of -1
to mm->mm_sched_cpu, even if mm->mm_sched_cpu is already -1. This causes
another
cache contention that threads on other CPUs trying to read the
mm->mm_sched_cpu.
We should update the mm_struct's mm_sched_cpu field only if the value
has been
changed.
That is to say, the following patch should fix the regression:
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2cca039d6e4f..3c1c50134647 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1032,7 +1032,11 @@ struct mm_struct {
raw_spinlock_t mm_sched_lock;
unsigned long mm_sched_epoch;
int mm_sched_cpu;
- u64 nr_running_avg;
+ /*
+ * mm_sched_cpu and nr_running_avg are put into seperate
+ * cachelines to avoid cache contention.
+ */
+ u64 nr_running_avg ____cacheline_aligned_in_smp;
#endif
#ifdef CONFIG_MMU
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 026013c826d9..4ef28db57a37 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1428,7 +1428,8 @@ void account_mm_sched(struct rq *rq, struct
task_struct *p, s64 delta_exec)
get_nr_threads(p) <= 1 ||
exceed_llc_nr(mm, cpu_of(rq)) ||
exceed_llc_capacity(mm, cpu_of(rq))) {
- mm->mm_sched_cpu = -1;
+ if (mm->mm_sched_cpu != -1)
+ mm->mm_sched_cpu = -1;
pcpu_sched->occ = 0;
}
--
2.25.1
With above change, the regression I mentioned in the cover letter, when
running multiple instances of hackbench on AMD Milan has disappeared.
And max latency improvement of sysbench+MariaDB are observed on Milan:
transactions per sec.: -0.72%
min latency: -0.00%
avg latency: -0.00%
max latency: +78.90%
95th percentile: -0.00%
events avg: -0.72%
events stddev: +50.72%
thanks,
Chenyu
Powered by blists - more mailing lists