linux-kernel - Re: [RFC][PATCH] sched: Cache aware load-balancing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <09ba5932-a256-4cdd-94dc-4f2b6569c855@intel.com>
Date: Mon, 31 Mar 2025 14:25:32 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Hillf Danton <hdanton@...a.com>
CC: <vincent.guittot@...aro.org>, <linux-kernel@...r.kernel.org>,
	<kprateek.nayak@....com>, <yu.chen.surf@...mail.com>, Peter Zijlstra
	<peterz@...radead.org>
Subject: Re: [RFC][PATCH] sched: Cache aware load-balancing

On 3/27/2025 7:20 PM, Hillf Danton wrote:
> On Wed, Mar 26, 2025 at 11:25:53AM +0100, Peter Zijlstra wrote:
>> On Wed, Mar 26, 2025 at 10:38:41AM +0100, Peter Zijlstra wrote:
>>
>>> Nah, the saner thing to do is to preserve the topology averages and look
>>> at those instead of the per-cpu values.
>>>
>>> Eg. have task_cache_work() compute and store averages in the
>>> sched_domain structure and then use those.
>>
>> A little something like so perhaps ?
>>
> My $.02 followup with the assumption that l2 cache temperature can not
> make sense without comparing. Just for idea show.
> 
> 	Hillf
> 
> --- m/include/linux/sched.h
> +++ n/include/linux/sched.h
> @@ -1355,6 +1355,11 @@ struct task_struct {
>   	unsigned long			numa_pages_migrated;
>   #endif /* CONFIG_NUMA_BALANCING */
>   
> +#ifdef CONFIG_SCHED_CACHE
> +#define LXC_SIZE 64 /* should be setup by parsing topology */
> +	unsigned long lxc_temp[LXC_SIZE]; /* x > 1, l2 cache temperature for instance */
> +#endif
> +
>   #ifdef CONFIG_RSEQ
>   	struct rseq __user *rseq;
>   	u32 rseq_len;
> --- m/kernel/sched/fair.c
> +++ n/kernel/sched/fair.c
> @@ -7953,6 +7953,22 @@ static int select_idle_sibling(struct ta
>   	if ((unsigned)i < nr_cpumask_bits)
>   		return i;
>   
> +#ifdef CONFIG_SCHED_CACHE
> +	/*
> +	 * 2, lxc temp can not make sense without comparing
> +	 *
> +	 * target can be any cpu if lxc is cold
> +	 */
> +	if ((unsigned int)prev_aff < nr_cpumask_bits)
> +		if (p->lxc_temp[per_cpu(sd_share_id, (unsigned int)prev_aff)] >
> +		    p->lxc_temp[per_cpu(sd_share_id, target)])
> +			target = prev_aff;
> +	if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
> +		if (p->lxc_temp[per_cpu(sd_share_id, (unsigned int)recent_used_cpu)] >
> +		    p->lxc_temp[per_cpu(sd_share_id, target)])
> +			target = recent_used_cpu;
> +	p->lxc_temp[per_cpu(sd_share_id, target)] += 1;
> +#else
>   	/*
>   	 * For cluster machines which have lower sharing cache like L2 or
>   	 * LLC Tag, we tend to find an idle CPU in the target's cluster
> @@ -7963,6 +7979,7 @@ static int select_idle_sibling(struct ta
>   		return prev_aff;
>   	if ((unsigned int)recent_used_cpu < nr_cpumask_bits)
>   		return recent_used_cpu;
> +#endif
>   
>   	return target;
>   }
> @@ -13059,6 +13076,13 @@ static void task_tick_fair(struct rq *rq
>   	if (static_branch_unlikely(&sched_numa_balancing))
>   		task_tick_numa(rq, curr);
>   
> +#ifdef CONFIG_SCHED_CACHE
> +	/*
> +	 * 0, lxc is defined cold after 2-second nap
> +	 * 1, task migrate across NUMA node makes lxc cold
> +	 */
> +	curr->lxc_temp[per_cpu(sd_share_id, rq->cpu)] += 5;

If lxc_temp is per task, this might be of another direction that to 
track each task's activity rather than the whole process activity.
The idea I think it is applicable to overwrite target to other CPU
if the latter is in a hot LLC, so select_idle_cpu() can search for
an idle CPU in cache hot LLC.

thanks,
Chenyu