linux-kernel - RE: [PATCH v3] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <BL1PR11MB6003FCF9DB3F0BFD72B0F70E96FCA@BL1PR11MB6003.namprd11.prod.outlook.com>
Date: Mon, 27 Oct 2025 05:38:08 +0000
From: "Deng, Pan" <pan.deng@...el.com>
To: "peterz@...radead.org" <peterz@...radead.org>, "mingo@...nel.org"
	<mingo@...nel.org>, "rostedt@...dmis.org" <rostedt@...dmis.org>,
	"juri.lelli@...hat.com" <juri.lelli@...hat.com>, "vincent.guittot@...aro.org"
	<vincent.guittot@...aro.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"dietmar.eggemann@....com" <dietmar.eggemann@....com>, "bsegall@...gle.com"
	<bsegall@...gle.com>, "mgorman@...e.de" <mgorman@...e.de>,
	"vschneid@...hat.com" <vschneid@...hat.com>, "Li, Tianyou"
	<tianyou.li@...el.com>, "Chen, Yu C" <yu.c.chen@...el.com>,
	"tim.c.chen@...ux.intel.com" <tim.c.chen@...ux.intel.com>
Subject: RE: [PATCH v3] sched/rt: Optimize cpupri_vec layout to mitigate cache
 line contention

> -----Original Message-----
> From: Deng, Pan <pan.deng@...el.com>
> Sent: Thursday, September 4, 2025 10:46 AM
> To: peterz@...radead.org; mingo@...nel.org; juri.lelli@...hat.com;
> vincent.guittot@...aro.org; rostedt@...dmis.org
> Cc: linux-kernel@...r.kernel.org; dietmar.eggemann@....com;
> bsegall@...gle.com; mgorman@...e.de; vschneid@...hat.com; Li, Tianyou
> <tianyou.li@...el.com>; Chen, Yu C <yu.c.chen@...el.com>;
> tim.c.chen@...ux.intel.com; Deng, Pan <pan.deng@...el.com>
> Subject: [PATCH v3] sched/rt: Optimize cpupri_vec layout to mitigate cache line
> contention
> 
> When running a multi-instance FFmpeg workload on an HCC system, significant
> cache line contention is observed around `cpupri_vec->count` and `mask` in
> struct root_domain.
> 
> The SUT is a 2-socket machine with 240 physical cores and 480 logical
> CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> with FIFO scheduling. FPS is used as score.
> 
> perf c2c tool reveals:
> root_domain cache line 3:
> - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
>    and contends with other fields, since counts[0] is more frequently
>    updated than others along with a rt task enqueues an empty runq or
>    dequeues from a non-overloaded runq.
> - cycles per load: ~10K to 59K
> 
> cpupri's last cache line:
> - `cpupri_vec->count` and `mask` contends. The transcoding threads use
>   rt pri 99, so that the contention occurs in the end.
> - cycles per load: ~1.5K to 10.5K
> 
> This change mitigates `cpupri_vec->count`, `mask` related contentions by
> separating each count and mask into different cache lines.
> 
> As a result:
> - FPS improves by ~11%
> - Kernel cycles% drops from ~20% to ~11%
> - `count` and `mask` related cache line contention is mitigated, perf c2c
>   shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
>   to ~0.5K-8K, cpupri's last cache line no longer appears in the report.
> - stress-ng cyclic benchmark is improved ~31.4%, command:
>   stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo   \
>                       --timeout 30 --minimize --metrics
> - rt-tests/pi_stress is improved ~76.5%, command:
>   rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
> - sched_normal tests hackbench and schbench are unaffected.
> 
> Appendix:
> 1. Current layout of contended data structure:
> struct root_domain {
>     ...
>     struct irq_work            rto_push_work;        /*   120    32 */
>     /* --- cacheline 2 boundary (128 bytes) was 24 bytes ago --- */
>     raw_spinlock_t             rto_lock;             /*   152     4 */
>     int                        rto_loop;             /*   156     4 */
>     int                        rto_cpu;              /*   160     4 */
>     atomic_t                   rto_loop_next;        /*   164     4 */
>     atomic_t                   rto_loop_start;       /*   168     4 */
>     /* XXX 4 bytes hole, try to pack */
>     cpumask_var_t              rto_mask;             /*   176     8 */
>     /* --- cacheline 3 boundary (192 bytes) was 8 bytes hence --- */
>     struct cpupri              cpupri;               /*   184  1624 */
>     /* --- cacheline 28 boundary (1792 bytes) was 16 bytes ago --- */
>     struct perf_domain *       pd;                   /*  1808     8 */
>     /* size: 1816, cachelines: 29, members: 21 */
>     /* sum members: 1802, holes: 3, sum holes: 14 */
>     /* forced alignments: 1 */
>     /* last cacheline: 24 bytes */
> } __attribute__((__aligned__(8)));
> 
> 2. Perf c2c report of root_domain cache line 3:
> -------  -------  ------  ------  ------  ------  ------------------------
>  Rmt      Lcl     Store   Data    Load    Total    Symbol
> Hitm%    Hitm%   L1 Hit%  offset  cycles  records
> -------  -------  ------  ------  ------  ------  ------------------------
>  353       44       62    0xff14d42c400e3880
> -------  -------  ------  ------  ------  ------  ------------------------
>  0.00%    2.27%    0.00%  0x0     21683   6     __flush_smp_call_function_
>  0.00%    2.27%    0.00%  0x0     22294   5     __flush_smp_call_function_
>  0.28%    0.00%    0.00%  0x0     0       2     irq_work_queue_on
>  0.28%    0.00%    0.00%  0x0     27824   4     irq_work_single
>  0.00%    0.00%    1.61%  0x0     28151   6     irq_work_queue_on
>  0.57%    0.00%    0.00%  0x18    21822   8     native_queued_spin_lock_sl
>  0.28%    2.27%    0.00%  0x18    16101   10    native_queued_spin_lock_sl
>  0.57%    0.00%    0.00%  0x18    33199   5     native_queued_spin_lock_sl
>  0.00%    0.00%    1.61%  0x18    10908   32    _raw_spin_lock
>  0.00%    0.00%    1.61%  0x18    59770   2     _raw_spin_lock
>  0.00%    0.00%    1.61%  0x18    0       1     _raw_spin_unlock
>  1.42%    0.00%    0.00%  0x20    12918   20    pull_rt_task
>  0.85%    0.00%   25.81%  0x24    31123   199   pull_rt_task
>  0.85%    0.00%    3.23%  0x24    38218   24    pull_rt_task
>  0.57%    4.55%   19.35%  0x28    30558   207   pull_rt_task
>  0.28%    0.00%    0.00%  0x28    55504   10    pull_rt_task
> 18.70%   18.18%    0.00%  0x30    26438   291   dequeue_pushable_task
> 17.28%   22.73%    0.00%  0x30    29347   281   enqueue_pushable_task
>  1.70%    2.27%    0.00%  0x30    12819   31    enqueue_pushable_task
>  0.28%    0.00%    0.00%  0x30    17726   18    dequeue_pushable_task
> 34.56%   29.55%    0.00%  0x38    25509   527   cpupri_find_fitness
> 13.88%   11.36%   24.19%  0x38    30654   342   cpupri_set
>  3.12%    2.27%    0.00%  0x38    18093   39    cpupri_set
>  1.70%    0.00%    0.00%  0x38    37661   52    cpupri_find_fitness
>  1.42%    2.27%   19.35%  0x38    31110   211   cpupri_set
>  1.42%    0.00%    1.61%  0x38    45035   31    cpupri_set
> 
> 3. Perf c2c report of cpupri's last cache line
> -------  -------  ------  ------  ------  ------  ------------------------
>  Rmt      Lcl     Store   Data    Load    Total    Symbol
> Hitm%    Hitm%   L1 Hit%  offset  cycles  records
> -------  -------  ------  ------  ------  ------  ------------------------
>  149       43       41    0xff14d42c400e3ec0
> -------  -------  ------  ------  ------  ------  ------------------------
>  8.72%   11.63%    0.00%  0x8     2001    165   cpupri_find_fitness
>  1.34%    2.33%    0.00%  0x18    1456    151   cpupri_find_fitness
>  8.72%    9.30%   58.54%  0x28    1744    263   cpupri_set
>  2.01%    4.65%   41.46%  0x28    1958    301   cpupri_set
>  1.34%    0.00%    0.00%  0x28    10580   6     cpupri_set
> 69.80%   67.44%    0.00%  0x30    1754    347   cpupri_set
>  8.05%    4.65%    0.00%  0x30    2144    256   cpupri_set
> 
> Signed-off-by: Pan Deng <pan.deng@...el.com>
> Signed-off-by: Tianyou Li <tianyou.li@...el.com>
> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> Reviewed-by: Tim Chen <tim.c.chen@...ux.intel.com>
> ---
> Changes since v1：
> - Use ____cacheline_aligned_in_smp instead of ____cacheline_aligned to
>   avoid wasting memory on UP systems.
> 
>  kernel/sched/cpupri.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> index 6f562088c056..f6bb74517fd6 100644
> --- a/kernel/sched/cpupri.h
> +++ b/kernel/sched/cpupri.h
> @@ -12,7 +12,7 @@
> 
>  struct cpupri_vec {
>  	atomic_t		count;
> -	cpumask_var_t		mask;
> +	cpumask_var_t		mask ____cacheline_aligned_in_smp;
>  };
> 
>  struct cpupri {
> --
> 2.43.5

@peterz, @mingo, @rostedt,

could you please take a look at this patch? Any comment is very appreciated.

Thanks
Pan