linux-kernel - Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to reduce cache misses

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dad0822a-5e0b-4518-a246-a3820787ed87@linux.ibm.com>
Date: Thu, 10 Apr 2025 13:49:31 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Zecheng Li <zecheng@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        Xu Liu <xliuprof@...gle.com>, Blake Jones <blakejones@...gle.com>,
        Josh Don <joshdon@...gle.com>, linux-kernel@...r.kernel.org,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to
 reduce cache misses

Hi Zecheng Li,

On 03/04/25 02:59, Zecheng Li wrote:
> Reorder the fields within the `struct cfs_rq` and `struct sched_entity`
> to improve cache locality. This can reduce cache misses to improve
> performance in CFS scheduling-related operations, particularly for
> servers with hundreds of cores and ~1000 cgroups.
> 
> The reordering is based on the kernel data-type profiling
> (https://lwn.net/Articles/955709/) indicating hot fields and fields
> that frequently accessed together.

This patch is based on optimizations by reordering for 64 byte systems.
In case of 128 byte L1 D-cache systems like Power10, this might or might
not be beneficial. Moreover lot of space(almost half) would be wasted
on the cache line due to APIs like `__cacheline_group_begin_aligned`
and `__cacheline_group_end_aligned` that may restrict size to 64 bytes.

Since this is in generic code, any ideas on how to make sure that
other architectures with different cache size don't suffer?

[..snip..]

> 
> 
> | Kernel LLC Misses | d3 w10            | d5 w4             |
> +-------------------+-------------------+-------------------+
> | AMD-orig          | [3025.5, 3344.1]M | [3382.4, 3607.8]M |
> | AMD-opt           | [2410.7, 2556.9]M | [2565.4, 2931.2]M |
> | Change            | -22.01%           | -21.37%           |
> | Intel-orig        | [1157.2, 1249.0]M | [1343.7, 1630.7]M |
> | Intel-opt         | [960.2, 1023.0]M  | [1092.7, 1350.7]M |
> | Change            | -17.59%           | -17.86%           |
> 
> Since the benchmark limits CPU quota, the RPS results reported by
> `schbench` did not show statistically significant improvement as it
> does not reflect the kernel overhead reduction.
> 
> Perf data shows the reduction of LLC misses percentage within the kernel
> for the depth 5, width 4 workload. The symbols are taken from the union
> of top 10 symbols in both original and optimized profiles.
> 
> | Symbol                                | Intel-orig | Intel-opt |
> +---------------------------------------+------------+-----------+
> | worker_thread                         | 75.41%     | 78.95%    |
> | tg_unthrottle_up                      | 3.21%      | 1.61%     |
> | tg_throttle_down                      | 2.42%      | 1.77%     |
> | __update_load_avg_cfs_rq              | 1.95%      | 1.60%     |
> | walk_tg_tree_from                     | 1.23%      | 0.91%     |
> | sched_balance_update_blocked_averages | 1.09%      | 1.13%     |
> | sched_balance_rq                      | 1.03%      | 1.08%     |
> | _raw_spin_lock                        | 1.01%      | 1.23%     |
> | task_mm_cid_work                      | 0.87%      | 1.09%     |
> | __update_load_avg_se                  | 0.78%      | 0.48%     |
> 
> | Symbol                                | AMD-orig | AMD-opt |
> +---------------------------------------+----------+---------+
> | worker_thread                         | 53.97%   | 61.49%  |
> | sched_balance_update_blocked_averages | 3.94%    | 2.48%   |
> | __update_load_avg_cfs_rq              | 3.52%    | 2.62%   |
> | update_load_avg                       | 2.66%    | 2.19%   |
> | tg_throttle_down                      | 1.99%    | 1.57%   |
> | tg_unthrottle_up                      | 1.98%    | 1.34%   |
> | __update_load_avg_se                  | 1.89%    | 1.32%   |
> | walk_tg_tree_from                     | 1.79%    | 1.37%   |
> | sched_clock_noinstr                   | 1.59%    | 1.01%   |
> | sched_balance_rq                      | 1.53%    | 1.26%   |
> | _raw_spin_lock                        | 1.47%    | 1.41%   |
> | task_mm_cid_work                      | 1.34%    | 1.42%   |
> 
> The percentage of the LLC misses in the system is reduced.

Due to the reordering of the fields, there might be some workloads
that could take a hit. May be try running workloads of different
kinds(latency and throughput oriented) and make sure that regression
is not high.

Thanks,
Madadi Vineeth Reddy

> 
> Zecheng Li (2):
>   sched/fair: Reorder struct cfs_rq
>   sched/fair: Reorder struct sched_entity
> 
>  include/linux/sched.h | 37 +++++++++++---------
>  kernel/sched/core.c   | 81 ++++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/sched.h  | 70 +++++++++++++++++++++++--------------
>  3 files changed, 144 insertions(+), 44 deletions(-)
> 
> 
> base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557