[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <BL1PR11MB6003E8FFBFA2712F4E61FAB7965CA@BL1PR11MB6003.namprd11.prod.outlook.com>
Date: Tue, 22 Jul 2025 14:46:55 +0000
From: "Deng, Pan" <pan.deng@...el.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "Li,
Tianyou" <tianyou.li@...el.com>, "tim.c.chen@...ux.intel.com"
<tim.c.chen@...ux.intel.com>, "peterz@...radead.org" <peterz@...radead.org>,
"mingo@...nel.org" <mingo@...nel.org>
Subject: RE: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node
to reduce contention
> -----Original Message-----
> From: Chen, Yu C <yu.c.chen@...el.com>
> Sent: Monday, July 21, 2025 7:24 PM
> To: Deng, Pan <pan.deng@...el.com>
> Cc: linux-kernel@...r.kernel.org; Li, Tianyou <tianyou.li@...el.com>;
> tim.c.chen@...ux.intel.com; peterz@...radead.org; mingo@...nel.org
> Subject: Re: [PATCH 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA
> node to reduce contention
>
> On 7/7/2025 10:35 AM, Pan Deng wrote:
> > When running a multi-instance FFmpeg workload on HCC system,
> > significant contention is observed on bitmap of `cpupri_vec->cpumask`.
> >
> > The SUT is a 2-socket machine with 240 physical cores and 480 logical
> > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical
> > cores
> > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> > with FIFO scheduling. FPS is used as score.
> >
> > perf c2c tool reveals:
> > cpumask (bitmap) cache line of `cpupri_vec->mask`:
> > - bits are loaded during cpupri_find
> > - bits are stored during cpupri_set
> > - cycles per load: ~2.2K to 8.7K
> >
> > This change splits `cpupri_vec->cpumask` into per-NUMA-node data to
> > mitigate false sharing.
> >
> > As a result:
> > - FPS improves by ~3.8%
> > - Kernel cycles% drops from ~20% to ~18.7%
> > - Cache line contention is mitigated, perf-c2c shows cycles per load
> > drops from ~2.2K-8.7K to ~0.5K-2.2K
> >
>
> This brings noticeable improvement for RT workload, and it would be even
> more convincing if we can have try on normal task workload, at least not bring
> regression(schbench/hackbenc, etc).
>
Thanks Yu, hackbench and schbench data will be provided later.
> thanks,
> Chenyu
>
> > Note: CONFIG_CPUMASK_OFFSTACK=n remains unchanged.
> >
>
Powered by blists - more mailing lists