[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <tencent_6E51A3175F8AE0A7F684A319EE63CC56C806@qq.com>
Date: Thu, 19 Jun 2025 14:39:17 +0800
From: Yangyu Chen <cyy@...self.name>
To: Tim Chen <tim.c.chen@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>,
Tim Chen <tim.c.chen@...el.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Libo Chen <libo.chen@...cle.com>,
Abel Wu <wuyun.abel@...edance.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>,
Len Brown <len.brown@...el.com>,
linux-kernel@...r.kernel.org,
Chen Yu <yu.c.chen@...el.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
Nice work!
I've tested your patch based on commit fb4d33ab452e and found it
incredibly helpful for Verilator with large RTL simulations like
XiangShan [1] on AMD EPYC Geona.
I've created a simple benchmark [2] using a static build of an
8-thread Verilator of XiangShan. Simply clone the repository and
run `make run`.
In a static allocated 8-CCX KVM (with a total of 128 vCPUs) on EPYC
9T24, before the patch, we have a simulation time of 49.348ms. This
was because each thread was distributed across every CCX, resulting
in extremely high core-to-core latency. However, after applying the
patch, the entire 8-thread Verilator is allocated to a single CCX.
Consequently, the simulation time was reduced to 24.196ms, which
is a remarkable 2.03x faster than before. We don't need numactl
anymore!
[1] https://github.com/OpenXiangShan/XiangShan
[2] https://github.com/cyyself/chacha20-xiangshan
Tested-by: Yangyu Chen <cyy@...self.name>
Thanks,
Yangyu Chen
On 19/6/2025 02:27, Tim Chen wrote:
> This is the third revision of the cache aware scheduling patches,
> based on the original patch proposed by Peter[1].
> The goal of the patch series is to aggregate tasks sharing data
> to the same cache domain, thereby reducing cache bouncing and
> cache misses, and improve data access efficiency. In the current
> implementation, threads within the same process are considered
> as entities that potentially share resources.
> In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
> 1) Aggregation of tasks during wake up led to load imbalance
> between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
> load balancing moved tasks in opposite directions, leading
> to continuous and excessive task migrations and regressions
> in benchmarks like schbench.
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:
> 1) Identify tasks that prefer to run on their hottest LLC and
> move them there.
> 2) Prevent generic load balancing from moving a task out of
> its hottest LLC.
> By default, LLC task aggregation during wake-up is disabled.
> Conversely, cache-aware load balancing is enabled by default.
> For easier comparison, two scheduler features are introduced:
> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> wake up and cache-aware load balancing, respectively. By default,
> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> is only done on load balancing.
> With above default settings, task migrations occur less frequently
> and no longer happen in the latency-sensitive wake-up path.
> The load balancing and migration policy are now implemented in
> a single location within the function _get_migrate_hint().
> Debugfs knobs are also introduced to fine-tune the
> _get_migrate_hint() function. Please refer to patch 7 for
> detail.
> Improvements in performance for hackbench are observed in the
> lower load ranges when tested on a 2 socket sapphire rapids with
> 30 cores per socket. The DRAM interleaving is enabled in the
> BIOS so it essential has one NUMA node with two last level
> caches. Hackbench benefits from having all the threads
> in the process running in the same LLC. There are some small
> regressions for the heavily loaded case when not all threads can
> fit in a LLC.
> Hackbench is run with one process, and pairs of threads ping
> ponging message off each other via command with increasing number
> of thread pairs, each test runs for 10 cycles:
> hackbench -g 1 --thread --pipe(socket) -l 1000000 -s 100 -f <pairs>
> case load baseline(std%) compare%( std%)
> threads-pipe-8 1-groups 1.00 ( 2.70) +24.51 ( 0.59)
> threads-pipe-15 1-groups 1.00 ( 1.42) +28.37 ( 0.68)
> threads-pipe-30 1-groups 1.00 ( 2.53) +26.16 ( 0.11)
> threads-pipe-45 1-groups 1.00 ( 0.48) +35.38 ( 0.18)
> threads-pipe-60 1-groups 1.00 ( 2.13) +13.46 ( 12.81)
> threads-pipe-75 1-groups 1.00 ( 1.57) +16.71 ( 0.20)
> threads-pipe-90 1-groups 1.00 ( 0.22) -0.57 ( 1.21)
> threads-sockets-8 1-groups 1.00 ( 2.82) +23.04 ( 0.83)
> threads-sockets-15 1-groups 1.00 ( 2.57) +21.67 ( 1.90)
> threads-sockets-30 1-groups 1.00 ( 0.75) +18.78 ( 0.09)
> threads-sockets-45 1-groups 1.00 ( 1.63) +18.89 ( 0.43)
> threads-sockets-60 1-groups 1.00 ( 0.66) +10.10 ( 1.91)
> threads-sockets-75 1-groups 1.00 ( 0.44) -14.49 ( 0.43)
> threads-sockets-90 1-groups 1.00 ( 0.15) -8.03 ( 3.88)
> Similar tests were also experimented on schbench on the system.
> Overall latency improvement is observed when underloaded and
> regression when overloaded. The regression is significantly
> smaller than the previous version because cache aware aggregation
> is in load balancing rather than in wake up path. Besides, it is
> found that the schbench seems to have large run-to-run variance,
> so the result of schbench might be only used as reference.
> schbench:
> baseline nowake_lb
> Lat 50.0th-qrtle-1 5.00 ( 0.00%) 5.00 ( 0.00%)
> Lat 90.0th-qrtle-1 9.00 ( 0.00%) 8.00 ( 11.11%)
> Lat 99.0th-qrtle-1 15.00 ( 0.00%) 15.00 ( 0.00%)
> Lat 99.9th-qrtle-1 32.00 ( 0.00%) 23.00 ( 28.12%)
> Lat 20.0th-qrtle-1 267.00 ( 0.00%) 266.00 ( 0.37%)
> Lat 50.0th-qrtle-2 8.00 ( 0.00%) 4.00 ( 50.00%)
> Lat 90.0th-qrtle-2 9.00 ( 0.00%) 7.00 ( 22.22%)
> Lat 99.0th-qrtle-2 18.00 ( 0.00%) 11.00 ( 38.89%)
> Lat 99.9th-qrtle-2 26.00 ( 0.00%) 25.00 ( 3.85%)
> Lat 20.0th-qrtle-2 535.00 ( 0.00%) 537.00 ( -0.37%)
> Lat 50.0th-qrtle-4 6.00 ( 0.00%) 4.00 ( 33.33%)
> Lat 90.0th-qrtle-4 8.00 ( 0.00%) 5.00 ( 37.50%)
> Lat 99.0th-qrtle-4 13.00 ( 0.00%) 10.00 ( 23.08%)
> Lat 99.9th-qrtle-4 20.00 ( 0.00%) 14.00 ( 30.00%)
> Lat 20.0th-qrtle-4 1066.00 ( 0.00%) 1050.00 ( 1.50%)
> Lat 50.0th-qrtle-8 5.00 ( 0.00%) 4.00 ( 20.00%)
> Lat 90.0th-qrtle-8 7.00 ( 0.00%) 5.00 ( 28.57%)
> Lat 99.0th-qrtle-8 11.00 ( 0.00%) 8.00 ( 27.27%)
> Lat 99.9th-qrtle-8 17.00 ( 0.00%) 18.00 ( -5.88%)
> Lat 20.0th-qrtle-8 2140.00 ( 0.00%) 2156.00 ( -0.75%)
> Lat 50.0th-qrtle-16 6.00 ( 0.00%) 4.00 ( 33.33%)
> Lat 90.0th-qrtle-16 7.00 ( 0.00%) 6.00 ( 14.29%)
> Lat 99.0th-qrtle-16 11.00 ( 0.00%) 11.00 ( 0.00%)
> Lat 99.9th-qrtle-16 18.00 ( 0.00%) 18.00 ( 0.00%)
> Lat 20.0th-qrtle-16 4296.00 ( 0.00%) 4216.00 ( 1.86%)
> Lat 50.0th-qrtle-32 6.00 ( 0.00%) 4.00 ( 33.33%)
> Lat 90.0th-qrtle-32 7.00 ( 0.00%) 5.00 ( 28.57%)
> Lat 99.0th-qrtle-32 11.00 ( 0.00%) 9.00 ( 18.18%)
> Lat 99.9th-qrtle-32 17.00 ( 0.00%) 14.00 ( 17.65%)
> Lat 20.0th-qrtle-32 8496.00 ( 0.00%) 8624.00 ( -1.51%)
> Lat 50.0th-qrtle-64 5.00 ( 0.00%) 5.00 ( 0.00%)
> Lat 90.0th-qrtle-64 7.00 ( 0.00%) 7.00 ( 0.00%)
> Lat 99.0th-qrtle-64 11.00 ( 0.00%) 11.00 ( 0.00%)
> Lat 99.9th-qrtle-64 17.00 ( 0.00%) 18.00 ( -5.88%)
> Lat 20.0th-qrtle-64 17120.00 ( 0.00%) 15728.00 ( 8.13%)
> Lat 50.0th-qrtle-128 6.00 ( 0.00%) 6.00 ( 0.00%)
> Lat 90.0th-qrtle-128 9.00 ( 0.00%) 8.00 ( 11.11%)
> Lat 99.0th-qrtle-128 13.00 ( 0.00%) 14.00 ( -7.69%)
> Lat 99.9th-qrtle-128 20.00 ( 0.00%) 26.00 ( -30.00%)
> Lat 20.0th-qrtle-128 19488.00 ( 0.00%) 18784.00 ( 3.61%)
> Lat 50.0th-qrtle-239 8.00 ( 0.00%) 8.00 ( 0.00%)
> Lat 90.0th-qrtle-239 16.00 ( 0.00%) 14.00 ( 12.50%)
> Lat 99.0th-qrtle-239 45.00 ( 0.00%) 41.00 ( 8.89%)
> Lat 99.9th-qrtle-239 137.00 ( 0.00%) 225.00 ( -64.23%)
> Lat 20.0th-qrtle-239 30432.00 ( 0.00%) 29920.00 ( 1.68%)
> AMD Milan is also tested. There are 4 Nodes and 32 CPUs per node.
> Each node has 4 CCX(shared LLC) and each CCX has 8 CPUs. Hackbench
> with 1 group test scenario benefits from cache aware load balance
> too:
> hackbench(1 group and fd ranges in [1,6]:
> case load baseline(std%) compare%( std%)
> threads-pipe-1 1-groups 1.00 ( 1.22) +2.84 ( 0.51)
> threads-pipe-2 1-groups 1.00 ( 5.82) +42.82 ( 43.61)
> threads-pipe-3 1-groups 1.00 ( 3.49) +17.33 ( 18.68)
> threads-pipe-4 1-groups 1.00 ( 2.49) +12.49 ( 5.89)
> threads-pipe-5 1-groups 1.00 ( 1.46) +8.62 ( 4.43)
> threads-pipe-6 1-groups 1.00 ( 2.83) +12.73 ( 8.94)
> threads-sockets-1 1-groups 1.00 ( 1.31) +28.68 ( 2.25)
> threads-sockets-2 1-groups 1.00 ( 5.17) +34.84 ( 36.90)
> threads-sockets-3 1-groups 1.00 ( 1.57) +9.15 ( 5.52)
> threads-sockets-4 1-groups 1.00 ( 1.99) +16.51 ( 6.04)
> threads-sockets-5 1-groups 1.00 ( 2.39) +10.88 ( 2.17)
> threads-sockets-6 1-groups 1.00 ( 1.62) +7.22 ( 2.00)
> Besides a single instance of hackbench, four instances of hackbench are
> also tested on Milan. The test results show that different instances of
> hackbench are aggregated to dedicated LLCs, and performance improvement
> is observed.
> schbench mmtests(unstable)
> baseline nowake_lb
> Lat 50.0th-qrtle-1 9.00 ( 0.00%) 8.00 ( 11.11%)
> Lat 90.0th-qrtle-1 12.00 ( 0.00%) 10.00 ( 16.67%)
> Lat 99.0th-qrtle-1 16.00 ( 0.00%) 14.00 ( 12.50%)
> Lat 99.9th-qrtle-1 22.00 ( 0.00%) 21.00 ( 4.55%)
> Lat 20.0th-qrtle-1 759.00 ( 0.00%) 759.00 ( 0.00%)
> Lat 50.0th-qrtle-2 9.00 ( 0.00%) 7.00 ( 22.22%)
> Lat 90.0th-qrtle-2 12.00 ( 0.00%) 12.00 ( 0.00%)
> Lat 99.0th-qrtle-2 16.00 ( 0.00%) 15.00 ( 6.25%)
> Lat 99.9th-qrtle-2 22.00 ( 0.00%) 21.00 ( 4.55%)
> Lat 20.0th-qrtle-2 1534.00 ( 0.00%) 1510.00 ( 1.56%)
> Lat 50.0th-qrtle-4 8.00 ( 0.00%) 9.00 ( -12.50%)
> Lat 90.0th-qrtle-4 12.00 ( 0.00%) 12.00 ( 0.00%)
> Lat 99.0th-qrtle-4 15.00 ( 0.00%) 16.00 ( -6.67%)
> Lat 99.9th-qrtle-4 21.00 ( 0.00%) 23.00 ( -9.52%)
> Lat 20.0th-qrtle-4 3076.00 ( 0.00%) 2860.00 ( 7.02%)
> Lat 50.0th-qrtle-8 10.00 ( 0.00%) 9.00 ( 10.00%)
> Lat 90.0th-qrtle-8 12.00 ( 0.00%) 13.00 ( -8.33%)
> Lat 99.0th-qrtle-8 17.00 ( 0.00%) 17.00 ( 0.00%)
> Lat 99.9th-qrtle-8 22.00 ( 0.00%) 24.00 ( -9.09%)
> Lat 20.0th-qrtle-8 6232.00 ( 0.00%) 5896.00 ( 5.39%)
> Lat 50.0th-qrtle-16 9.00 ( 0.00%) 9.00 ( 0.00%)
> Lat 90.0th-qrtle-16 13.00 ( 0.00%) 13.00 ( 0.00%)
> Lat 99.0th-qrtle-16 17.00 ( 0.00%) 18.00 ( -5.88%)
> Lat 99.9th-qrtle-16 23.00 ( 0.00%) 26.00 ( -13.04%)
> Lat 20.0th-qrtle-16 10096.00 ( 0.00%) 10352.00 ( -2.54%)
> Lat 50.0th-qrtle-32 15.00 ( 0.00%) 15.00 ( 0.00%)
> Lat 90.0th-qrtle-32 25.00 ( 0.00%) 26.00 ( -4.00%)
> Lat 99.0th-qrtle-32 49.00 ( 0.00%) 50.00 ( -2.04%)
> Lat 99.9th-qrtle-32 945.00 ( 0.00%) 1005.00 ( -6.35%)
> Lat 20.0th-qrtle-32 11600.00 ( 0.00%) 11632.00 ( -0.28%)
> Netperf/Tbench have not been tested yet. As they are single-process
> benchmarks that are not the target of this cache-aware scheduling.
> Additionally, client and server components should be tested on
> different machines or bound to different nodes. Otherwise,
> cache-aware scheduling might harm their performance: placing client
> and server in the same LLC could yield higher throughput due to
> improved cache locality in the TCP/IP stack, whereas cache-aware
> scheduling aims to place them in dedicated LLCs.
> This patch set is applied on v6.15 kernel.
> There are some further work needed for future versions in this
> patch set. We will need to align NUMA balancing with LLC aggregations
> such that LLC aggregation will align with the preferred NUMA node.
> Comments and tests are much appreciated.
> [1] https://lore.kernel.org/all/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> The patches are grouped as follow:
> Patch 1: Peter's original patch.
> Patch 2-5: Various fixes and tuning of the original v1 patch.
> Patch 6-12: Infrastructure and helper functions for load balancing to be cache aware.
> Patch 13-18: Add logic to load balancing for preferred LLC aggregation.
> Patch 19: Add process LLC aggregation in load balancing sched feature.
> Patch 20: Add Process LLC aggregation in wake up sched feature (turn off by default).
> v1:
> https://lore.kernel.org/lkml/20250325120952.GJ36322@noisy.programming.kicks-ass.net/
> v2:
> https://lore.kernel.org/lkml/cover.1745199017.git.yu.c.chen@intel.com/
> Chen Yu (3):
> sched: Several fixes for cache aware scheduling
> sched: Avoid task migration within its preferred LLC
> sched: Save the per LLC utilization for better cache aware scheduling
> K Prateek Nayak (1):
> sched: Avoid calculating the cpumask if the system is overloaded
> Peter Zijlstra (1):
> sched: Cache aware load-balancing
> Tim Chen (15):
> sched: Add hysteresis to switch a task's preferred LLC
> sched: Add helper function to decide whether to allow cache aware
> scheduling
> sched: Set up LLC indexing
> sched: Introduce task preferred LLC field
> sched: Calculate the number of tasks that have LLC preference on a
> runqueue
> sched: Introduce per runqueue task LLC preference counter
> sched: Calculate the total number of preferred LLC tasks during load
> balance
> sched: Tag the sched group as llc_balance if it has tasks prefer other
> LLC
> sched: Introduce update_llc_busiest() to deal with groups having
> preferred LLC tasks
> sched: Introduce a new migration_type to track the preferred LLC load
> balance
> sched: Consider LLC locality for active balance
> sched: Consider LLC preference when picking tasks from busiest queue
> sched: Do not migrate task if it is moving out of its preferred LLC
> sched: Introduce SCHED_CACHE_LB to control cache aware load balance
> sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
> up
> include/linux/mm_types.h | 44 ++
> include/linux/sched.h | 8 +
> include/linux/sched/topology.h | 3 +
> init/Kconfig | 4 +
> init/init_task.c | 3 +
> kernel/fork.c | 5 +
> kernel/sched/core.c | 25 +-
> kernel/sched/debug.c | 4 +
> kernel/sched/fair.c | 859 ++++++++++++++++++++++++++++++++-
> kernel/sched/features.h | 3 +
> kernel/sched/sched.h | 23 +
> kernel/sched/topology.c | 29 ++
> 12 files changed, 982 insertions(+), 28 deletions(-)
Powered by blists - more mailing lists