linux-kernel - Re: [RFC patch v3 00/20] Cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c9328e19-3b18-4ea3-a692-9cb02534e5c9@intel.com>
Date: Sun, 22 Jun 2025 08:39:38 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, Tim Chen
	<tim.c.chen@...ux.intel.com>, Peter Zijlstra <peterz@...radead.org>, "Ingo
 Molnar" <mingo@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>, Abel Wu
	<wuyun.abel@...edance.com>, Hillf Danton <hdanton@...a.com>, Len Brown
	<len.brown@...el.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling

On 6/21/2025 3:25 AM, Madadi Vineeth Reddy wrote:
> Hi Tim,
> 
> On 18/06/25 23:57, Tim Chen wrote:
>> This is the third revision of the cache aware scheduling patches,
>> based on the original patch proposed by Peter[1].
>>   
>> The goal of the patch series is to aggregate tasks sharing data
>> to the same cache domain, thereby reducing cache bouncing and
>> cache misses, and improve data access efficiency. In the current
>> implementation, threads within the same process are considered
>> as entities that potentially share resources.
>>   
>> In previous versions, aggregation of tasks were done in the
>> wake up path, without making load balancing paths aware of
>> LLC (Last-Level-Cache) preference. This led to the following
>> problems:
>>
>> 1) Aggregation of tasks during wake up led to load imbalance
>>     between LLCs
>> 2) Load balancing tried to even out the load between LLCs
>> 3) Wake up tasks aggregation happened at a faster rate and
>>     load balancing moved tasks in opposite directions, leading
>>     to continuous and excessive task migrations and regressions
>>     in benchmarks like schbench.
>>
>> In this version, load balancing is made cache-aware. The main
>> idea of cache-aware load balancing consists of two parts:
>>
>> 1) Identify tasks that prefer to run on their hottest LLC and
>>     move them there.
>> 2) Prevent generic load balancing from moving a task out of
>>     its hottest LLC.
>>
>> By default, LLC task aggregation during wake-up is disabled.
>> Conversely, cache-aware load balancing is enabled by default.
>> For easier comparison, two scheduler features are introduced:
>> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
>> wake up and cache-aware load balancing, respectively. By default,
>> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
>> is only done on load balancing.
> 
> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
> LLC on this platform spans 4 threads.
> 
> schbench:
>                          baseline (sd%)        baseline+cacheaware (sd%)      %change
> Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
> Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
> Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
> Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%
> 
> Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
> Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
> Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%
> 
> Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
> Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
> Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
> Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%
> 
> Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
> Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
> Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
> Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%
> 
> Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
> Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
> Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
> Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%
> 
> Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
> Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
> Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
> Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%
> 
> Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
> Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
> Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
> Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%
> 
> Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
> Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
> Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
> Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%
> 
> Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
> Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
> Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
> Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%
> 
> The above data shows mostly regression both in the lesser and
> higher load cases.
> 
> 
> Hackbench pipe:
> 
> Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
> 2       2.987 (1.19%)               2.414 (17.99%)              24.06%
> 4       7.702 (12.53%)              7.228 (18.37%)               6.16%
> 8       14.141 (1.32%)              13.109 (1.46%)               7.29%
> 15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
> 30      65.118 (4.49%)              61.352 (4.00%)               5.78%
> 45      105.086 (9.75%)             97.970 (4.26%)               6.77%
> 60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
> 75      199.278 (1.21%)             198.680 (1.37%)              0.30%
> 
> A lot of run to run variation is seen in hackbench runs. So hard to tell
> on the performance but looks better than schbench.

May I know if the cpu frequency was set at a fixed level and deep
cpu idle states were disabled(I assume on power system it is called
stop states?)

> 
> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
> when compared to platforms like sapphire rapids and Milan. Didn't go
> through this series yet. Will go through and try to understand why
> schbench is not happy on Power systems.
> 
> Meanwhile, Wanted to know your thoughts on how does smaller LLC
> size get impacted with this patch?
> 

task aggregation on smaller LLC domain(both in terms of the
number of CPUs and the size of LLC) might bring cache contention
and hurt performance IMO. May I know what is the cache size on
your system:
lscpu | grep "L3 cache"

May I know if you tested it with:
echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_LB > /sys/kernel/debug/sched/features

vs

echo SCHED_CACHE > /sys/kernel/debug/sched/features
echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
echo SCHED_CACHE_LB > /sys/kernel/debug/sched/features

And could you help check if setting /sys/kernel/debug/sched/llc_aggr_cap
from 50 to some smaller values(25, etc) would help?

thanks,
Chenyu

> Thanks,
> Madadi Vineeth Reddy
> 
> 
>>
>> With above default settings, task migrations occur less frequently
>> and no longer happen in the latency-sensitive wake-up path.
>>
> 
> [..snip..]
> 
>>
>> Chen Yu (3):
>>    sched: Several fixes for cache aware scheduling
>>    sched: Avoid task migration within its preferred LLC
>>    sched: Save the per LLC utilization for better cache aware scheduling
>>
>> K Prateek Nayak (1):
>>    sched: Avoid calculating the cpumask if the system is overloaded
>>
>> Peter Zijlstra (1):
>>    sched: Cache aware load-balancing
>>
>> Tim Chen (15):
>>    sched: Add hysteresis to switch a task's preferred LLC
>>    sched: Add helper function to decide whether to allow cache aware
>>      scheduling
>>    sched: Set up LLC indexing
>>    sched: Introduce task preferred LLC field
>>    sched: Calculate the number of tasks that have LLC preference on a
>>      runqueue
>>    sched: Introduce per runqueue task LLC preference counter
>>    sched: Calculate the total number of preferred LLC tasks during load
>>      balance
>>    sched: Tag the sched group as llc_balance if it has tasks prefer other
>>      LLC
>>    sched: Introduce update_llc_busiest() to deal with groups having
>>      preferred LLC tasks
>>    sched: Introduce a new migration_type to track the preferred LLC load
>>      balance
>>    sched: Consider LLC locality for active balance
>>    sched: Consider LLC preference when picking tasks from busiest queue
>>    sched: Do not migrate task if it is moving out of its preferred LLC
>>    sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>>    sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>>      up
>>
>>   include/linux/mm_types.h       |  44 ++
>>   include/linux/sched.h          |   8 +
>>   include/linux/sched/topology.h |   3 +
>>   init/Kconfig                   |   4 +
>>   init/init_task.c               |   3 +
>>   kernel/fork.c                  |   5 +
>>   kernel/sched/core.c            |  25 +-
>>   kernel/sched/debug.c           |   4 +
>>   kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>>   kernel/sched/features.h        |   3 +
>>   kernel/sched/sched.h           |  23 +
>>   kernel/sched/topology.c        |  29 ++
>>   12 files changed, 982 insertions(+), 28 deletions(-)
>>
>