linux-kernel - RE: [PATCH] sched/rt: optimize cpupri

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <BL1PR11MB60032009A9BE7D802782A1139674A@BL1PR11MB6003.namprd11.prod.outlook.com>
Date: Thu, 12 Jun 2025 03:11:04 +0000
From: "Deng, Pan" <pan.deng@...el.com>
To: "Deng, Pan" <pan.deng@...el.com>, "peterz@...radead.org"
	<peterz@...radead.org>, "mingo@...nel.org" <mingo@...nel.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Li,
 Tianyou" <tianyou.li@...el.com>, "tim.c.chen@...ux.intel.com"
	<tim.c.chen@...ux.intel.com>
Subject: RE: [PATCH] sched/rt: optimize cpupri_vec layout

As an alternative, with a little bit complicated change, we can separate counts
and masks into 2 vectors inlined in cpupri(counts[] and masks[]), and add two
paddings:
1. Between counts[0] and counts[1], since counts[0] is more frequently 
updated than others along with a rt task enqueues an empty runq or 
dequeues from a non-overloaded runq.
2. Between the two vectors, since counts[] is RW while masks[] is read
access when it stores pointers.    
The alternative approach introduces the complexity of 31+/21- LoC changes,
while it achieves the same performance as the simple, at the same time, struct
cpupri size is reduced from 26 cache lines to 21 cache lines.
The alternative approach is also prepared, can be sent out if you have any interest.

Best Regards
Pan

> -----Original Message-----
> From: Pan Deng <pan.deng@...el.com>
> Sent: Thursday, June 12, 2025 11:12 AM
> To: peterz@...radead.org; mingo@...nel.org
> Cc: linux-kernel@...r.kernel.org; Li, Tianyou <tianyou.li@...el.com>;
> tim.c.chen@...ux.intel.com; Deng, Pan <pan.deng@...el.com>
> Subject: [PATCH] sched/rt: optimize cpupri_vec layout
> 
> When running a multi-instance ffmpeg transcoding workload which uses rt
> thread in a high core count system, cpupri_vec->count contends with the
> reading of mask in the same cache line in function cpupri_find_fitness and
> cpupri_set.
> This change separates each count and mask into different cache lines by cache
> aligned attribute to avoid the false sharing.
> Tested in a 2 sockets, 240 physical core 480 logical core machine, running
> 60 ffmpeg transcoding instances. With the change, the kernel cycles% is
> reduced from ~20% to ~12%, the fps metric is improved ~11%.
> The side effect of this change is that struct cpupri size is increased from 26
> cache lines to 203 cache lines.
> 
> Signed-off-by: Pan Deng <pan.deng@...el.com>
> Signed-off-by: Tianyou Li <tianyou.li@...el.com>
> Reviewed-by: Tim Chen <tim.c.chen@...ux.intel.com>
> ---
>  kernel/sched/cpupri.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h index
> d6cba0020064..245b0fa626be 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -9,7 +9,7 @@
> 
>  struct cpupri_vec {
>  	atomic_t		count;
> -	cpumask_var_t		mask;
> +	cpumask_var_t		mask	____cacheline_aligned;
>  };
> 
>  struct cpupri {
> --
> 2.43.5