[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dad0822a-5e0b-4518-a246-a3820787ed87@linux.ibm.com>
Date: Thu, 10 Apr 2025 13:49:31 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Zecheng Li <zecheng@...gle.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Xu Liu <xliuprof@...gle.com>, Blake Jones <blakejones@...gle.com>,
Josh Don <joshdon@...gle.com>, linux-kernel@...r.kernel.org,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [RFC PATCH 0/2] sched/fair: Reorder scheduling related structs to
reduce cache misses
Hi Zecheng Li,
On 03/04/25 02:59, Zecheng Li wrote:
> Reorder the fields within the `struct cfs_rq` and `struct sched_entity`
> to improve cache locality. This can reduce cache misses to improve
> performance in CFS scheduling-related operations, particularly for
> servers with hundreds of cores and ~1000 cgroups.
>
> The reordering is based on the kernel data-type profiling
> (https://lwn.net/Articles/955709/) indicating hot fields and fields
> that frequently accessed together.
This patch is based on optimizations by reordering for 64 byte systems.
In case of 128 byte L1 D-cache systems like Power10, this might or might
not be beneficial. Moreover lot of space(almost half) would be wasted
on the cache line due to APIs like `__cacheline_group_begin_aligned`
and `__cacheline_group_end_aligned` that may restrict size to 64 bytes.
Since this is in generic code, any ideas on how to make sure that
other architectures with different cache size don't suffer?
[..snip..]
>
>
> | Kernel LLC Misses | d3 w10 | d5 w4 |
> +-------------------+-------------------+-------------------+
> | AMD-orig | [3025.5, 3344.1]M | [3382.4, 3607.8]M |
> | AMD-opt | [2410.7, 2556.9]M | [2565.4, 2931.2]M |
> | Change | -22.01% | -21.37% |
> | Intel-orig | [1157.2, 1249.0]M | [1343.7, 1630.7]M |
> | Intel-opt | [960.2, 1023.0]M | [1092.7, 1350.7]M |
> | Change | -17.59% | -17.86% |
>
> Since the benchmark limits CPU quota, the RPS results reported by
> `schbench` did not show statistically significant improvement as it
> does not reflect the kernel overhead reduction.
>
> Perf data shows the reduction of LLC misses percentage within the kernel
> for the depth 5, width 4 workload. The symbols are taken from the union
> of top 10 symbols in both original and optimized profiles.
>
> | Symbol | Intel-orig | Intel-opt |
> +---------------------------------------+------------+-----------+
> | worker_thread | 75.41% | 78.95% |
> | tg_unthrottle_up | 3.21% | 1.61% |
> | tg_throttle_down | 2.42% | 1.77% |
> | __update_load_avg_cfs_rq | 1.95% | 1.60% |
> | walk_tg_tree_from | 1.23% | 0.91% |
> | sched_balance_update_blocked_averages | 1.09% | 1.13% |
> | sched_balance_rq | 1.03% | 1.08% |
> | _raw_spin_lock | 1.01% | 1.23% |
> | task_mm_cid_work | 0.87% | 1.09% |
> | __update_load_avg_se | 0.78% | 0.48% |
>
> | Symbol | AMD-orig | AMD-opt |
> +---------------------------------------+----------+---------+
> | worker_thread | 53.97% | 61.49% |
> | sched_balance_update_blocked_averages | 3.94% | 2.48% |
> | __update_load_avg_cfs_rq | 3.52% | 2.62% |
> | update_load_avg | 2.66% | 2.19% |
> | tg_throttle_down | 1.99% | 1.57% |
> | tg_unthrottle_up | 1.98% | 1.34% |
> | __update_load_avg_se | 1.89% | 1.32% |
> | walk_tg_tree_from | 1.79% | 1.37% |
> | sched_clock_noinstr | 1.59% | 1.01% |
> | sched_balance_rq | 1.53% | 1.26% |
> | _raw_spin_lock | 1.47% | 1.41% |
> | task_mm_cid_work | 1.34% | 1.42% |
>
> The percentage of the LLC misses in the system is reduced.
Due to the reordering of the fields, there might be some workloads
that could take a hit. May be try running workloads of different
kinds(latency and throughput oriented) and make sure that regression
is not high.
Thanks,
Madadi Vineeth Reddy
>
> Zecheng Li (2):
> sched/fair: Reorder struct cfs_rq
> sched/fair: Reorder struct sched_entity
>
> include/linux/sched.h | 37 +++++++++++---------
> kernel/sched/core.c | 81 ++++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 70 +++++++++++++++++++++++--------------
> 3 files changed, 144 insertions(+), 44 deletions(-)
>
>
> base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
Powered by blists - more mailing lists