linux-kernel - Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC preference counter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cbacc6df3b197566e763739d96508aaabf01bfc0.camel@linux.intel.com>
Date: Wed, 10 Dec 2025 10:49:14 -0800
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...hat.com>, K Prateek Nayak
 <kprateek.nayak@....com>,  "Gautham R . Shenoy" <gautham.shenoy@....com>,
 Vincent Guittot <vincent.guittot@...aro.org>, Juri Lelli	
 <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
 Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel
 Gorman <mgorman@...e.de>,  Valentin Schneider	 <vschneid@...hat.com>,
 Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, Hillf Danton
 <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>, Jianyong Wu	
 <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>, Tingyin Duan	
 <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Vern Hao	
 <haoxing990@...il.com>, Len Brown <len.brown@...el.com>, Aubrey Li	
 <aubrey.li@...el.com>, Zhao Liu <zhao1.liu@...el.com>, Chen Yu	
 <yu.chen.surf@...il.com>, Chen Yu <yu.c.chen@...el.com>, Adam Li	
 <adamli@...amperecomputing.com>, Aaron Lu <ziqianlu@...edance.com>, Tim
 Chen	 <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 07/23] sched/cache: Introduce per runqueue task LLC
 preference counter

On Wed, 2025-12-10 at 13:51 +0100, Peter Zijlstra wrote:
> On Wed, Dec 03, 2025 at 03:07:26PM -0800, Tim Chen wrote:
> 
> > +static int resize_llc_pref(void)
> > +{
> > +	unsigned int *__percpu *tmp_llc_pref;
> > +	int i, ret = 0;
> > +
> > +	if (new_max_llcs <= max_llcs)
> > +		return 0;
> > +
> > +	/*
> > +	 * Allocate temp percpu pointer for old llc_pref,
> > +	 * which will be released after switching to the
> > +	 * new buffer.
> > +	 */
> > +	tmp_llc_pref = alloc_percpu_noprof(unsigned int *);
> > +	if (!tmp_llc_pref)
> > +		return -ENOMEM;
> > +
> > +	for_each_present_cpu(i)
> > +		*per_cpu_ptr(tmp_llc_pref, i) = NULL;
> > +
> > +	/*
> > +	 * Resize the per rq nr_pref_llc buffer and
> > +	 * switch to this new buffer.
> > +	 */
> > +	for_each_present_cpu(i) {
> > +		struct rq_flags rf;
> > +		unsigned int *new;
> > +		struct rq *rq;
> > +
> > +		rq = cpu_rq(i);
> > +		new = alloc_new_pref_llcs(rq->nr_pref_llc, per_cpu_ptr(tmp_llc_pref, i));
> > +		if (!new) {
> > +			ret = -ENOMEM;
> > +
> > +			goto release_old;
> > +		}
> > +
> > +		/*
> > +		 * Locking rq ensures that rq->nr_pref_llc values
> > +		 * don't change with new task enqueue/dequeue
> > +		 * when we repopulate the newly enlarged array.
> > +		 */
> > +		rq_lock_irqsave(rq, &rf);
> > +		populate_new_pref_llcs(rq->nr_pref_llc, new);
> > +		rq->nr_pref_llc = new;
> > +		rq_unlock_irqrestore(rq, &rf);
> > +	}
> > +
> > +release_old:
> > +	/*
> > +	 * Load balance is done under rcu_lock.
> > +	 * Wait for load balance before and during resizing to
> > +	 * be done. They may refer to old nr_pref_llc[]
> > +	 * that hasn't been resized.
> > +	 */
> > +	synchronize_rcu();
> > +	for_each_present_cpu(i)
> > +		kfree(*per_cpu_ptr(tmp_llc_pref, i));
> > +
> > +	free_percpu(tmp_llc_pref);
> > +
> > +	/* succeed and update */
> > +	if (!ret)
> > +		max_llcs = new_max_llcs;
> > +
> > +	return ret;
> > +}
> 
> Would it perhaps be easier to stick this thing in rq->sd rather than in
> rq->nr_pref_llc. That way it automagically switches with the 'new'
> domain. And then, with a bit of care, a singe load-balance pass should
> see a consistent view (there should not be reloads of rq->sd -- which
> will be a bit of an audit I suppose).

We need nr_pref_llc information at the runqueue level because the load balancer 
must identify which specific rq has the largest number of tasks that 
prefer a given destination LLC. If we move the counter to the LLC’s sd 
level, we would only know the aggregate number of tasks in the entire LLC 
that prefer that destination—not which rq they reside on. Without per-rq 
counts, we would not be able to select the correct source rq to pull tasks from.

The only way this could work at the LLC-sd level is if all CPUs within 
the LLC shared a single runqueue, which is not the case today.

Let me know if I understand your comments correctly.

Tim