linux-kernel - [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250402212904.8866-1-zecheng@google.com>
Date: Wed,  2 Apr 2025 21:29:00 +0000
From: Zecheng Li <zecheng@...gle.com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, Xu Liu <xliuprof@...gle.com>, 
	Blake Jones <blakejones@...gle.com>, Josh Don <joshdon@...gle.com>, linux-kernel@...r.kernel.org, 
	Zecheng Li <zecheng@...gle.com>
Subject: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to
 reduce cache misses

Reorder the fields within the `struct cfs_rq` and `struct sched_entity`
to improve cache locality. This can reduce cache misses to improve
performance in CFS scheduling-related operations, particularly for
servers with hundreds of cores and ~1000 cgroups.

The reordering is based on the kernel data-type profiling
(https://lwn.net/Articles/955709/) indicating hot fields and fields
that frequently accessed together.

This reordering aims to optimize cache utilization and improve the
performance of scheduling-related functions, particularly
`tg_throttle_down`, `tg_unthrottle_up`, and `__update_load_avg_cfs_rq`.
The reordering mainly considers performance when
`CONFIG_FAIR_GROUP_SCHED` is configured. When it is disabled, there is
no CFS bandwidth control and only a single `cfs_rq` exists per CPU, thus
its layout would not significantly impact performance.

We use a benchmark with multiple cgroup levels to simulate real server
load. The benchmark constructs a tree structure hierarchy of cgroups,
with “width” and “depth” parameters controlling the number of children
per node and the depth of the tree. Each leaf cgroup runs a `schbench`
workload and gets an 80% quota of the total CPU quota divided by number
of leaf cgroups (in other words, the target CPU load is set to 80%) to
exercise the throttling functions. Bandwidth control period is set to
10ms. We run the benchmark on Intel and AMD machines; each machine has
hundreds of threads.

Kernel LLC load misses for 30 seconds. d3 w10 (wider tree) means a
cgroup hierarchy of 3 levels, each level has 10 children, totaling 1000
leaf cgroups. d5 w4 represents a deeper tree with more hierarchies. Each
benchmark is run 10 times, the table shows 95% confidence intervals of
the kernel LLC misses in millions.


| Kernel LLC Misses | d3 w10            | d5 w4             |
+-------------------+-------------------+-------------------+
| AMD-orig          | [3025.5, 3344.1]M | [3382.4, 3607.8]M |
| AMD-opt           | [2410.7, 2556.9]M | [2565.4, 2931.2]M |
| Change            | -22.01%           | -21.37%           |
| Intel-orig        | [1157.2, 1249.0]M | [1343.7, 1630.7]M |
| Intel-opt         | [960.2, 1023.0]M  | [1092.7, 1350.7]M |
| Change            | -17.59%           | -17.86%           |

Since the benchmark limits CPU quota, the RPS results reported by
`schbench` did not show statistically significant improvement as it
does not reflect the kernel overhead reduction.

Perf data shows the reduction of LLC misses percentage within the kernel
for the depth 5, width 4 workload. The symbols are taken from the union
of top 10 symbols in both original and optimized profiles.

| Symbol                                | Intel-orig | Intel-opt |
+---------------------------------------+------------+-----------+
| worker_thread                         | 75.41%     | 78.95%    |
| tg_unthrottle_up                      | 3.21%      | 1.61%     |
| tg_throttle_down                      | 2.42%      | 1.77%     |
| __update_load_avg_cfs_rq              | 1.95%      | 1.60%     |
| walk_tg_tree_from                     | 1.23%      | 0.91%     |
| sched_balance_update_blocked_averages | 1.09%      | 1.13%     |
| sched_balance_rq                      | 1.03%      | 1.08%     |
| _raw_spin_lock                        | 1.01%      | 1.23%     |
| task_mm_cid_work                      | 0.87%      | 1.09%     |
| __update_load_avg_se                  | 0.78%      | 0.48%     |

| Symbol                                | AMD-orig | AMD-opt |
+---------------------------------------+----------+---------+
| worker_thread                         | 53.97%   | 61.49%  |
| sched_balance_update_blocked_averages | 3.94%    | 2.48%   |
| __update_load_avg_cfs_rq              | 3.52%    | 2.62%   |
| update_load_avg                       | 2.66%    | 2.19%   |
| tg_throttle_down                      | 1.99%    | 1.57%   |
| tg_unthrottle_up                      | 1.98%    | 1.34%   |
| __update_load_avg_se                  | 1.89%    | 1.32%   |
| walk_tg_tree_from                     | 1.79%    | 1.37%   |
| sched_clock_noinstr                   | 1.59%    | 1.01%   |
| sched_balance_rq                      | 1.53%    | 1.26%   |
| _raw_spin_lock                        | 1.47%    | 1.41%   |
| task_mm_cid_work                      | 1.34%    | 1.42%   |

The percentage of the LLC misses in the system is reduced.

Zecheng Li (2):
  sched/fair: Reorder struct cfs_rq
  sched/fair: Reorder struct sched_entity

 include/linux/sched.h | 37 +++++++++++---------
 kernel/sched/core.c   | 81 ++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h  | 70 +++++++++++++++++++++++--------------
 3 files changed, 144 insertions(+), 44 deletions(-)


base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
-- 
2.49.0