[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <BL1PR11MB60032009A9BE7D802782A1139674A@BL1PR11MB6003.namprd11.prod.outlook.com>
Date: Thu, 12 Jun 2025 03:11:04 +0000
From: "Deng, Pan" <pan.deng@...el.com>
To: "Deng, Pan" <pan.deng@...el.com>, "peterz@...radead.org"
<peterz@...radead.org>, "mingo@...nel.org" <mingo@...nel.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Li,
Tianyou" <tianyou.li@...el.com>, "tim.c.chen@...ux.intel.com"
<tim.c.chen@...ux.intel.com>
Subject: RE: [PATCH] sched/rt: optimize cpupri_vec layout
As an alternative, with a little bit complicated change, we can separate counts
and masks into 2 vectors inlined in cpupri(counts[] and masks[]), and add two
paddings:
1. Between counts[0] and counts[1], since counts[0] is more frequently
updated than others along with a rt task enqueues an empty runq or
dequeues from a non-overloaded runq.
2. Between the two vectors, since counts[] is RW while masks[] is read
access when it stores pointers.
The alternative approach introduces the complexity of 31+/21- LoC changes,
while it achieves the same performance as the simple, at the same time, struct
cpupri size is reduced from 26 cache lines to 21 cache lines.
The alternative approach is also prepared, can be sent out if you have any interest.
Best Regards
Pan
> -----Original Message-----
> From: Pan Deng <pan.deng@...el.com>
> Sent: Thursday, June 12, 2025 11:12 AM
> To: peterz@...radead.org; mingo@...nel.org
> Cc: linux-kernel@...r.kernel.org; Li, Tianyou <tianyou.li@...el.com>;
> tim.c.chen@...ux.intel.com; Deng, Pan <pan.deng@...el.com>
> Subject: [PATCH] sched/rt: optimize cpupri_vec layout
>
> When running a multi-instance ffmpeg transcoding workload which uses rt
> thread in a high core count system, cpupri_vec->count contends with the
> reading of mask in the same cache line in function cpupri_find_fitness and
> cpupri_set.
> This change separates each count and mask into different cache lines by cache
> aligned attribute to avoid the false sharing.
> Tested in a 2 sockets, 240 physical core 480 logical core machine, running
> 60 ffmpeg transcoding instances. With the change, the kernel cycles% is
> reduced from ~20% to ~12%, the fps metric is improved ~11%.
> The side effect of this change is that struct cpupri size is increased from 26
> cache lines to 203 cache lines.
>
> Signed-off-by: Pan Deng <pan.deng@...el.com>
> Signed-off-by: Tianyou Li <tianyou.li@...el.com>
> Reviewed-by: Tim Chen <tim.c.chen@...ux.intel.com>
> ---
> kernel/sched/cpupri.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h index
> d6cba0020064..245b0fa626be 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -9,7 +9,7 @@
>
> struct cpupri_vec {
> atomic_t count;
> - cpumask_var_t mask;
> + cpumask_var_t mask ____cacheline_aligned;
> };
>
> struct cpupri {
> --
> 2.43.5
Powered by blists - more mailing lists