linux-kernel - [RFC PATCH v2 0/3] sched/fair: Reorder scheduling related structs to reduce cache misses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250602180544.3626909-1-zecheng@google.com>
Date: Mon,  2 Jun 2025 18:05:40 +0000
From: Zecheng Li <zecheng@...gle.com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, Xu Liu <xliuprof@...gle.com>, 
	Blake Jones <blakejones@...gle.com>, Josh Don <joshdon@...gle.com>, 
	Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, linux-kernel@...r.kernel.org, 
	Zecheng Li <zecheng@...gle.com>
Subject: [RFC PATCH v2 0/3] sched/fair: Reorder scheduling related structs to
 reduce cache misses

Reorder the fields within the `struct cfs_rq` and `struct sched_entity`
to improve cache locality. The two structs are heavily accessed, and
their layouts have not been optimized yet. This can reduce cache misses,
improving performance in CFS scheduling-related operations, particularly
for servers with hundreds of cores and ~1000 cgroups.

The reordering is based on kernel data-type profiling
(https://lwn.net/Articles/955709/) indicating hot fields and fields
frequently accessed together on real workloads.

This reordering aims to optimize cache utilization and improve the
performance of scheduling-related functions, particularly
`tg_throttle_down`, `tg_unthrottle_up`,`__update_load_avg_cfs_rq`, and
`sched_balance_update_blocked_averages`. The reordering mainly considers
performance when `CONFIG_FAIR_GROUP_SCHED` is configured. When it is
disabled, there is no CFS bandwidth control and only a single `cfs_rq`
exists per CPU, thus its layout would not significantly impact
performance.

We use a benchmark with multiple cgroup levels to simulate real server
load. The benchmark constructs a tree structure hierarchy of cgroups,
with “width” and “depth” parameters controlling the number of children
per node and the depth of the tree. Each leaf cgroup runs a `schbench`
workload and gets an 80% quota of the total CPU quota divided by number
of leaf cgroups (in other words, the target CPU load is set to 80%) to
exercise the throttling functions. Bandwidth control period is set to
10ms. We run the benchmark on Intel and AMD machines; each machine has
hundreds of threads.

Kernel LLC load misses for 30 seconds. d3 w10 (wider tree) means a
cgroup hierarchy of 3 levels, each level has 10 children, totaling 1000
leaf cgroups. d5 w4 represents a deeper tree with more hierarchies. Each
benchmark is run 10 times. The table shows 95% confidence intervals of
the kernel LLC misses in millions.

| Kernel LLC Misses | d3 w10            | d5 w4             |
+-------------------+-------------------+-------------------+
| AMD-orig          | [3025.5, 3344.1]M | [3382.4, 3607.8]M |
| AMD-opt           | [2410.7, 2556.9]M | [2565.4, 2931.2]M |
| Change            | -22.01%           | -21.37%           |
| Intel-orig        | [1157.2, 1249.0]M | [1343.7, 1630.7]M |
| Intel-opt         | [960.2, 1023.0]M  | [1092.7, 1350.7]M |
| Change            | -17.59%           | -17.86%           |

Since the benchmark limits CPU quota, the RPS results reported by
`schbench` did not show statistically significant improvement as it
does not reflect the kernel overhead reduction.

Perf data shows the reduction of LLC misses percentage within the kernel
for the depth 5, width 4 workload. The symbols are taken from the union
of top 10 symbols in both original and optimized profiles.

| Symbol                                | Intel-orig | Intel-opt |
+---------------------------------------+------------+-----------+
| worker_thread                         | 75.41%     | 78.95%    |
| tg_unthrottle_up                      | 3.21%      | 1.61%     |
| tg_throttle_down                      | 2.42%      | 1.77%     |
| __update_load_avg_cfs_rq              | 1.95%      | 1.60%     |
| walk_tg_tree_from                     | 1.23%      | 0.91%     |
| sched_balance_update_blocked_averages | 1.09%      | 1.13%     |
| sched_balance_rq                      | 1.03%      | 1.08%     |
| _raw_spin_lock                        | 1.01%      | 1.23%     |
| task_mm_cid_work                      | 0.87%      | 1.09%     |
| __update_load_avg_se                  | 0.78%      | 0.48%     |

| Symbol                                | AMD-orig | AMD-opt |
+---------------------------------------+----------+---------+
| worker_thread                         | 53.97%   | 61.49%  |
| sched_balance_update_blocked_averages | 3.94%    | 2.48%   |
| __update_load_avg_cfs_rq              | 3.52%    | 2.62%   |
| update_load_avg                       | 2.66%    | 2.19%   |
| tg_throttle_down                      | 1.99%    | 1.57%   |
| tg_unthrottle_up                      | 1.98%    | 1.34%   |
| __update_load_avg_se                  | 1.89%    | 1.32%   |
| walk_tg_tree_from                     | 1.79%    | 1.37%   |
| sched_clock_noinstr                   | 1.59%    | 1.01%   |
| sched_balance_rq                      | 1.53%    | 1.26%   |
| _raw_spin_lock                        | 1.47%    | 1.41%   |
| task_mm_cid_work                      | 1.34%    | 1.42%   |

The percentage of the LLC misses in the system is reduced.

Other benchmarks (without CPU share limits) show no regression.
They ran in a cgroup hierarchy with depth 5, width 4 setting (1024
instances) for 10 runs. perf profiles confirm the targeted CFS functions
have lower LLC misses after reordering cfs_rq and sched_entity.

For avg latency and time, lower is better; for record/s, higher is
better. Ranges are 95% CI.

           | base                 | opt                  | metric
-----------+----------------------+----------------------+-------------
 sysbench  | 45.05 [44.69, 45.42] | 44.51 [43.99, 45.04] | avg latency
 hackbench | 27.00 [26.39, 27.61] | 27.29 [26.38, 28.20] | time
 ebizzy    | 948.2 [848.1, 1048]  | 951.8 [840.8, 1063]  | record/s

LLC cache miss in kernel. The fluctuation is in the order of ±10%.

           | base   | opt
-----------+--------+--------
 sysbench  | 404M   | 351M
 hackbench | 3,294M | 3,169M
 ebizzy    | 2,149M | 1,956M

Although we see a noticable reduction in LLC misses, the application
throughput improvement is negligible because the cycles spent accessing
cfs_rq and sched_entity are around ~1% among these benchmarks. The
improvement saves 20% of the cycles accessing them, which is a ~0.2%
direct saving. However, this benefits most workloads with quite large
cgroup hierarchies.

v2 updates:

- Add macros to conditionally align a cache group to avoid extra RAM
paddings on architectures with cacheline sizes other than 64B.

- Add more benchmark results.

Zecheng Li (3):
  cache: conditionally align cache groups
  sched/fair: Reorder struct cfs_rq
  sched/fair: Reorder struct sched_entity

 include/linux/cache.h | 28 +++++++++++++++
 include/linux/sched.h | 39 +++++++++++----------
 kernel/sched/core.c   | 81 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h  | 81 +++++++++++++++++++++++++++++--------------
 4 files changed, 184 insertions(+), 45 deletions(-)


base-commit: a5806cd506af5a7c19bcd596e4708b5c464bfd21
-- 
2.49.0