linux-kernel - Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <60eb87ce-1a1b-4eff-b179-25736e3c8e60@intel.com>
Date: Tue, 16 Dec 2025 15:40:00 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>, Tim Chen
	<tim.c.chen@...ux.intel.com>
CC: Ingo Molnar <mingo@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Juri Lelli <juri.lelli@...hat.com>, "Dietmar
 Eggemann" <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, "Valentin
 Schneider" <vschneid@...hat.com>, Madadi Vineeth Reddy
	<vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>, Shrikanth Hegde
	<sshegde@...ux.ibm.com>, Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen
	<cyy@...self.name>, Tingyin Duan <tingyin.duan@...il.com>, Vern Hao
	<vernhao@...cent.com>, Vern Hao <haoxing990@...il.com>, Len Brown
	<len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
	<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Adam Li
	<adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>, Tim Chen
	<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2 17/23] sched/cache: Record the number of active threads
 per process for cache-aware scheduling

On 12/11/2025 12:51 AM, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
>> From: Chen Yu <yu.c.chen@...el.com>
>>
>> A performance regression was observed by Prateek when running hackbench
>> with many threads per process (high fd count). To avoid this, processes
>> with a large number of active threads are excluded from cache-aware
>> scheduling.
>>
>> With sched_cache enabled, record the number of active threads in each
>> process during the periodic task_cache_work(). While iterating over
>> CPUs, if the currently running task belongs to the same process as the
>> task that launched task_cache_work(), increment the active thread count.
>>
>> This number will be used by subsequent patch to inhibit cache aware
>> load balance.
>>
>> Suggested-by: K Prateek Nayak <kprateek.nayak@....com>
>> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
>> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
>> ---
>>
>> Notes:
>>      v1->v2: No change.
>>
>>   include/linux/mm_types.h |  1 +
>>   kernel/sched/fair.c      | 11 +++++++++--
>>   2 files changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 1ea16ef90566..04743983de4d 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1043,6 +1043,7 @@ struct mm_struct {
>>   		raw_spinlock_t mm_sched_lock;
>>   		unsigned long mm_sched_epoch;
>>   		int mm_sched_cpu;
>> +		u64 nr_running_avg ____cacheline_aligned_in_smp;
> 
> This is unlikely to do what you hope it does, it will place this
> variable on a new cacheline, but will not ensure this variable is the
> only one in that line. Notably ogtables_bytes (the next field in this
> structure) will share the line.
> 
> It might all be less dodgy if you stick these here fields in their own
> structure, a little like mm_mm_cid or so.
> 

Got it, will do.

>>   #endif
>>   
>>   #ifdef CONFIG_MMU
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 580a967efdac..2f38ad82688f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
>>   
>>   static void __no_profile task_cache_work(struct callback_head *work)
>>   {
>> -	struct task_struct *p = current;
>> +	struct task_struct *p = current, *cur;
>>   	struct mm_struct *mm = p->mm;
>>   	unsigned long m_a_occ = 0;
>>   	unsigned long curr_m_a_occ = 0;
>> -	int cpu, m_a_cpu = -1;
>> +	int cpu, m_a_cpu = -1, nr_running = 0;
>>   	cpumask_var_t cpus;
>>   
>>   	WARN_ON_ONCE(work != &p->cache_work);
>> @@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct callback_head *work)
>>   					m_occ = occ;
>>   					m_cpu = i;
>>   				}
> 
> 	guard(rcu)();
> 

OK.

>> +				rcu_read_lock();
>> +				cur = rcu_dereference(cpu_rq(i)->curr);
>> +				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
>> +				    cur->mm == mm)
>> +					nr_running++;
>> +				rcu_read_unlock();
>>   			}
>>   
>>   			/*
>> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>>   		mm->mm_sched_cpu = m_a_cpu;
>>   	}
>>   
>> +	update_avg(&mm->nr_running_avg, nr_running);
>>   	free_cpumask_var(cpus);
>>   }
> 
> Its a wee bit weird to introduce nr_running_avg without its user. Makes
> it hard to see what's what.

OK, will put the user together.

thanks,
Chenyu