linux-kernel - Re: [PATCH v2 17/23] sched/cache: Record the number of active threads per process for cache-aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20251210165119.GY3707891@noisy.programming.kicks-ass.net>
Date: Wed, 10 Dec 2025 17:51:19 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Tim Chen <tim.c.chen@...ux.intel.com>
Cc: Ingo Molnar <mingo@...hat.com>,
	K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Chen Yu <yu.c.chen@...el.com>, Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
	Hillf Danton <hdanton@...a.com>,
	Shrikanth Hegde <sshegde@...ux.ibm.com>,
	Jianyong Wu <jianyong.wu@...look.com>,
	Yangyu Chen <cyy@...self.name>,
	Tingyin Duan <tingyin.duan@...il.com>,
	Vern Hao <vernhao@...cent.com>, Vern Hao <haoxing990@...il.com>,
	Len Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>,
	Zhao Liu <zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>,
	Adam Li <adamli@...amperecomputing.com>,
	Aaron Lu <ziqianlu@...edance.com>, Tim Chen <tim.c.chen@...el.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 17/23] sched/cache: Record the number of active
 threads per process for cache-aware scheduling

On Wed, Dec 03, 2025 at 03:07:36PM -0800, Tim Chen wrote:
> From: Chen Yu <yu.c.chen@...el.com>
> 
> A performance regression was observed by Prateek when running hackbench
> with many threads per process (high fd count). To avoid this, processes
> with a large number of active threads are excluded from cache-aware
> scheduling.
> 
> With sched_cache enabled, record the number of active threads in each
> process during the periodic task_cache_work(). While iterating over
> CPUs, if the currently running task belongs to the same process as the
> task that launched task_cache_work(), increment the active thread count.
> 
> This number will be used by subsequent patch to inhibit cache aware
> load balance.
> 
> Suggested-by: K Prateek Nayak <kprateek.nayak@....com>
> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> ---
> 
> Notes:
>     v1->v2: No change.
> 
>  include/linux/mm_types.h |  1 +
>  kernel/sched/fair.c      | 11 +++++++++--
>  2 files changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 1ea16ef90566..04743983de4d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1043,6 +1043,7 @@ struct mm_struct {
>  		raw_spinlock_t mm_sched_lock;
>  		unsigned long mm_sched_epoch;
>  		int mm_sched_cpu;
> +		u64 nr_running_avg ____cacheline_aligned_in_smp;

This is unlikely to do what you hope it does, it will place this
variable on a new cacheline, but will not ensure this variable is the
only one in that line. Notably ogtables_bytes (the next field in this
structure) will share the line.

It might all be less dodgy if you stick these here fields in their own
structure, a little like mm_mm_cid or so.

>  #endif
>  
>  #ifdef CONFIG_MMU
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 580a967efdac..2f38ad82688f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1421,11 +1421,11 @@ static void task_tick_cache(struct rq *rq, struct task_struct *p)
>  
>  static void __no_profile task_cache_work(struct callback_head *work)
>  {
> -	struct task_struct *p = current;
> +	struct task_struct *p = current, *cur;
>  	struct mm_struct *mm = p->mm;
>  	unsigned long m_a_occ = 0;
>  	unsigned long curr_m_a_occ = 0;
> -	int cpu, m_a_cpu = -1;
> +	int cpu, m_a_cpu = -1, nr_running = 0;
>  	cpumask_var_t cpus;
>  
>  	WARN_ON_ONCE(work != &p->cache_work);
> @@ -1458,6 +1458,12 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  					m_occ = occ;
>  					m_cpu = i;
>  				}

	guard(rcu)();

> +				rcu_read_lock();
> +				cur = rcu_dereference(cpu_rq(i)->curr);
> +				if (cur && !(cur->flags & (PF_EXITING | PF_KTHREAD)) &&
> +				    cur->mm == mm)
> +					nr_running++;
> +				rcu_read_unlock();
>  			}
>  
>  			/*
> @@ -1501,6 +1507,7 @@ static void __no_profile task_cache_work(struct callback_head *work)
>  		mm->mm_sched_cpu = m_a_cpu;
>  	}
>  
> +	update_avg(&mm->nr_running_avg, nr_running);
>  	free_cpumask_var(cpus);
>  }

Its a wee bit weird to introduce nr_running_avg without its user. Makes
it hard to see what's what.