[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CABk29Nt_LmXCDDZevgcNvStetVRM6L=p-6c+99jXaU=CpuSPvw@mail.gmail.com>
Date: Wed, 4 Feb 2026 15:16:41 -0800
From: Josh Don <joshdon@...gle.com>
To: Peter Zijlstra <peterz@...radead.org>, Zecheng Li <zli94@...u.edu>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, Rik van Riel <riel@...riel.com>, Chris Mason <clm@...com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, Xu Liu <xliuprof@...gle.com>,
Blake Jones <blakejones@...gle.com>, Nilay Vaish <nilayvaish@...gle.com>,
K Prateek Nayak <kprateek.nayak@....com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v8 0/3] sched/fair: Optimize cfs_rq and sched_entity
allocation for better data locality
On Wed, Jan 21, 2026 at 12:34 PM Zecheng Li <zli94@...u.edu> wrote:
>
> Hi all,
>
> This patch series improves CFS cache performance by allocating cfs_rq
> and sched_entity together in the per-cpu allocator. It allows for
> replacing the pointer arrays in task_group with a per-cpu offset.
>
> Accessing cfs_rq and sched_entity instances incurs many cache misses.
> This series of patches aims to reduce these cache misses. A struct
> cfs_rq instance is per CPU and per task_group. Each task_group instance
> (and the root runqueue) holds cfs_rq instances per CPU. Additionally,
> there are corresponding struct sched_entity instances for each cfs_rq
> instance (except the root). Currently, both cfs_rq and sched_entity
> instances are allocated in NUMA-local memory using kzalloc_node, and
> tg->cfs_rq and tg->se are arrays of pointers.
>
> Original memory layout:
>
> tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
> tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
>
> +----+ +-----------------+
> | tg | ----> | cfs_rq pointers |
> +----+ +-----------------+
> | | |
> v v v
> cfs_rq cfs_rq cfs_rq
>
> +----+ +--------------------+
> | tg | ----> | sched_entity ptrs |
> +----+ +--------------------+
> | | |
> v v v
> se se se
>
> Layout after Optimization:
>
> +--------+ | CPU 0 | | CPU 1 | | CPU 2 |
> | tg | | percpu | | percpu | | percpu |
> | | ... ... ...
> | percpu | -> | cfs_rq | | cfs_rq | | cfs_rq |
> | offset | | se | | se | | se |
> +--------+ +--------+ +--------+ +--------+
>
> The optimization includes two parts:
>
> 1) Co-allocate cfs_rq and sched_entity for non-root task groups.
>
> - This benefits loading the sched_entity for the parent runqueue.
> Currently it incurs pointer chasing, i.e., cfs_rq->tg->se[cpu]. After
> co-locating, the sched_entity fields can be loaded with simple offset
> computations from cfs_rq.
>
> 2) Allocate the combined cfs_rq/se struct using percpu allocator.
>
> - Accesses to cfs_rq instances in hot paths are mostly iterating through
> multiple task_groups for the same CPU. Therefore, the new percpu
> layout can reuse the base pointer, and they are more likely to reside
> in the CPU cache than the per-task_group pointer arrays.
>
> - This optimization also reduces the memory needed for the array of
> pointers.
>
> To measure the impact of the patch series, we construct a tree structure
> hierarchy of cgroups, with “width” and “depth” parameters controlling
> the number of children per node and the depth of the tree. Each leaf
> cgroup runs a schbench workload and gets an 80% quota of the total CPU
> quota divided by number of leaf cgroups (in other words, the target CPU
> load is set to 80%) to exercise the throttling functions. Bandwidth
> control period is set to 10ms. We run the benchmark on Intel and AMD
> machines; each machine has hundreds of threads.
>
> Tests were conducted on 6.15.
>
> | Kernel LLC Misses | depth 3 width 10 | depth 5 width 4 |
> +-------------------+---------------------+---------------------+
> | AMD-orig | [2218.98, 2241.89]M | [2599.80, 2645.16]M |
> | AMD-opt | [1957.62, 1981.55]M | [2380.47, 2431.86]M |
> | Change | -11.69% | -8.248% |
> | Intel-orig | [1580.53, 1604.90]M | [2125.37, 2208.68]M |
> | Intel-opt | [1066.94, 1100.19]M | [1543.77, 1570.83]M |
> | Change | -31.96% | -28.13% |
>
> There's also a 25% improvement on kernel IPC on the AMD system. On
> Intel, the improvement is 3% despite a greater LLC miss reduction.
Peter, any thoughts on this? The results seem promising.
> Other workloads without CPU share limits, while also running in a cgroup
> hierarchy with O(1000) instances, show no obvious regression:
>
> sysbench, hackbench - lower is better; ebizzy - higher is better.
>
> workload | base | opt | metric
> ----------+-----------------------+-----------------------+------------
> sysbench | 63.55, [63.04, 64.05] | 64.36, [62.97, 65.75] | avg latency
> hackbench | 36.95, [35.45, 38.45] | 37.12, [35.81, 38.44] | time
> ebizzy | 610.7, [569.8, 651.6] | 613.5, [592.1, 635.0] | record/s
Zecheng, am I reading those benchmark stats wrong, or is the 'opt'
version slightly worse than 'base'?
Powered by blists - more mailing lists