[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <60eb87ce-1a1b-4eff-b179-25736e3c8e60@intel.com>
Date: Tue, 16 Dec 2025 15:40:00 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>, Tim Chen
<tim.c.chen@...ux.intel.com>
CC: Ingo Molnar <mingo@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>, Vincent Guittot
<vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, "Dietmar
Eggemann" <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, "Valentin
Schneider" <vschneid@...hat.com>, Madadi Vineeth Reddy
<vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>, Shrikanth Hegde
<sshegde@...ux.ibm.com>, Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen
<cyy@...self.name>, Tingyin Duan <tingyin.duan@...il.com>, Vern Hao
<vernhao@...cent.com>, Vern Hao <haoxing990@...il.com>, Len Brown
<len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
<adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>, Tim Chen
<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 17/23] sched/cache: Record the number of active threads
per process for cache-aware scheduling
On 12/11/2025 12:51 AM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@...el.com>
>>
>> A performance regression was observed by Prateek when running hackbench
>> with many threads per process (high fd count). To avoid this, processes
>> with a large number of active threads are excluded from cache-aware
>> scheduling.
>>
>> With sched_cache enabled, record the number of active threads in each
>> process during the periodic task_cache_work(). While iterating over
>> CPUs, if the currently running task belongs to the same process as the
>> task that launched task_cache_work(), increment the active thread count.
>>
>> This number will be used by subsequent patch to inhibit cache aware
>> load balance.
>>
>> Suggested-by: K Prateek Nayak <kprateek.nayak@....com>
>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
>> ---
>>
>> Notes:
>> v1->v2: No change.
>>
>> include/linux/mm_types.h | 1 +
>> kernel/sched/fair.c | 11 +++++++++--
>> 2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 1ea16ef90566..04743983de4d 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1043,6 +1043,7 @@ struct mm_struct {
>> raw_spinlock_t mm_sched_lock;
>> unsigned long mm_sched_epoch;
>> int mm_sched_cpu;
>> + u64 nr_running_avg ____cacheline_aligned_in_smp;
>
> This is unlikely to do what you hope it does, it will place this
> variable on a new cacheline, but will not ensure this variable is the
> only one in that line. Notably ogtables_bytes (the next field in this
> structure) will share the line.
>
> It might all be less dodgy if you stick these here fields in their own
> structure, a little like mm_mm_cid or so.
>
Got it, will do.
>> #endif
>>
>> #ifdef CONFIG_MMU
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 580a967efdac..2f38ad82688f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
>>
>> static void __no_profile task_cache_work(struct callback_head *work)
>> {
>> - struct task_struct *p = current;
>> + struct task_struct *p = current, *cur;
>> struct mm_struct *mm = p->mm;
>> unsigned long m_a_occ = 0;
>> unsigned long curr_m_a_occ = 0;
>> - int cpu, m_a_cpu = -1;
>> + int cpu, m_a_cpu = -1, nr_running = 0;
>> cpumask_var_t cpus;
>>
>> WARN_ON_ONCE(work != &p->cache_work);
>> @@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct callback_head *work)
>> m_occ = occ;
>> m_cpu = i;
>> }
>
> guard(rcu)();
>
OK.
>> + rcu_read_lock();
>> + cur = rcu_dereference(cpu_rq(i)->curr);
>> + if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
>> + cur->mm == mm)
>> + nr_running++;
>> + rcu_read_unlock();
>> }
>>
>> /*
>> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>> mm->mm_sched_cpu = m_a_cpu;
>> }
>>
>> + update_avg(&mm->nr_running_avg, nr_running);
>> free_cpumask_var(cpus);
>> }
>
> Its a wee bit weird to introduce nr_running_avg without its user. Makes
> it hard to see what's what.
OK, will put the user together.
thanks,
Chenyu
Powered by blists - more mailing lists