[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250604195846.193159-1-zecheng@google.com>
Date: Wed, 4 Jun 2025 19:58:40 +0000
From: Zecheng Li <zecheng@...gle.com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>, Xu Liu <xliuprof@...gle.com>,
Blake Jones <blakejones@...gle.com>, Josh Don <joshdon@...gle.com>, linux-kernel@...r.kernel.org,
Zecheng Li <zecheng@...gle.com>
Subject: [RFC PATCH 0/3] sched/fair: Optimize cfs_rq and sched_entity
allocation for better data locality
Accessing cfs_rq and sched_entity instances incurs many cache misses.
This series of patches aims to reduce these cache misses. A struct
cfs_rq instance is per CPU and per task_group. Each task_group instance
(and the root runqueue) holds cfs_rq instances per CPU. Additionally,
there are corresponding struct sched_entity instances for each cfs_rq
instance (except the root). Currently, both cfs_rq and sched_entity
instances are allocated in NUMA-local memory using kzalloc_node, and
tg->cfs_rq and tg->se are arrays of pointers.
Original memory layout:
tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
tg->se = kcalloc(nr_cpu_ids, sizeof(se), GFP_KERNEL);
+----+ +-----------------+
| tg | ----> | cfs_rq pointers |
+----+ +-----------------+
| | |
v v v
cfs_rq cfs_rq cfs_rq
+----+ +--------------------+
| tg | ----> | sched_entity ptrs |
+----+ +--------------------+
| | |
v v v
se se se
Layout after Optimization:
+--------+ | CPU 0 | | CPU 1 | | CPU 2 |
| tg | | percpu | | percpu | | percpu |
| | ... ... ...
| percpu | -> | cfs_rq | | cfs_rq | | cfs_rq |
| offset | | se | | se | | se |
+--------+ +--------+ +--------+ +--------+
The optimization includes two parts:
1) Embed sched_entity to cfs_rq.
- This benefits loading the sched_entity for the parent runqueue.
Currently it incurs pointer chasing, i.e., cfs_rq->tg->se[cpu]. After
embedding, the sched_entity fields can be loaded with simple offset
computations. As a tradeoff, the root task_group instance needs to
allocate memory for its sched_entity, which is not needed before
optimization. In the worst case, this will increase #CPU *
sizeof(sched_entity) RAM usage, which is small.
2) Allocate cfs_rq using percpu allocator.
- Accesses to cfs_rq instances in hot paths are mostly iterating through
multiple task_groups for the same CPU. Therefore, the new percpu layout
can reuse the base pointer, and they are more likely to reside in the
CPU cache than the per-task_group pointer arrays.
- This optimization also reduces the memory needed for the array of
pointers.
To measure the impact of the patch series, we construct a tree structure
hierarchy of cgroups, with “width” and “depth” parameters controlling
the number of children per node and the depth of the tree. Each leaf
cgroup runs a schbench workload and gets an 80% quota of the total CPU
quota divided by number of leaf cgroups (in other words, the target CPU
load is set to 80%) to exercise the throttling functions. Bandwidth
control period is set to 10ms. We run the benchmark on Intel and AMD
machines; each machine has hundreds of threads.
Tests were conducted on Kernel 6.15.
| Kernel LLC Misses | depth 3 width 10 | depth 5 width 4 |
+-------------------+---------------------+---------------------+
| AMD-orig | [2218.98, 2241.89]M | [2599.80, 2645.16]M |
| AMD-opt | [1957.62, 1981.55]M | [2380.47, 2431.86]M |
| Change | -11.69% | -8.248% |
| Intel-orig | [1580.53, 1604.90]M | [2125.37, 2208.68]M |
| Intel-opt | [1066.94, 1100.19]M | [1543.77, 1570.83]M |
| Change | -31.96% | -28.13% |
There's also a 25% improvement on kernel IPC on the AMD system. On
Intel, the improvement is 3% despite a greater LLC miss reduction.
Other workloads without CPU share limits, while also running in a cgroup
hierarchy with O(1000) instances, show no obvious regression:
sysbench, hackbench - lower is better; ebizzy - higher is better.
workload | base | opt | metric
----------+-----------------------+-----------------------+------------
sysbench | 63.55, [63.04, 64.05] | 64.36, [62.97, 65.75] | avg latency
hackbench | 36.95, [35.45, 38.45] | 37.12, [35.81, 38.44] | time
ebizzy | 610.7, [569.8, 651.6] | 613.5, [592.1, 635.0] | record/s
Zecheng Li (3):
sched/fair: Embed sched_entity into cfs_rq
sched/fair: Remove task_group->se pointer
sched/fair: Allocate cfs_rq structs per-cpu
kernel/sched/core.c | 40 ++++++++-------------
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 83 ++++++++++++++++----------------------------
kernel/sched/sched.h | 40 ++++++++++++++++-----
4 files changed, 76 insertions(+), 89 deletions(-)
base-commit: 0ff41df1cb268fc69e703a08a57ee14ae967d0ca
--
2.50.0
Powered by blists - more mailing lists