linux-kernel - Re: [RFC patch v3 00/20] Cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8c98fff7-fef3-494a-98a3-4b6d4cc2e6d1@linux.ibm.com>
Date: Sat, 21 Jun 2025 00:55:08 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: Tim Chen <tim.c.chen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Vincent Guittot
 <vincent.guittot@...aro.org>,
        Libo Chen <libo.chen@...cle.com>, Abel Wu <wuyun.abel@...edance.com>,
        Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
        linux-kernel@...r.kernel.org, Chen Yu <yu.c.chen@...el.com>,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling

Hi Tim,

On 18/06/25 23:57, Tim Chen wrote:
> This is the third revision of the cache aware scheduling patches,
> based on the original patch proposed by Peter[1].
>  
> The goal of the patch series is to aggregate tasks sharing data
> to the same cache domain, thereby reducing cache bouncing and
> cache misses, and improve data access efficiency. In the current
> implementation, threads within the same process are considered
> as entities that potentially share resources.
>  
> In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
> 
> 1) Aggregation of tasks during wake up led to load imbalance
>    between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
>    load balancing moved tasks in opposite directions, leading
>    to continuous and excessive task migrations and regressions
>    in benchmarks like schbench.
> 
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:
> 
> 1) Identify tasks that prefer to run on their hottest LLC and
>    move them there.
> 2) Prevent generic load balancing from moving a task out of
>    its hottest LLC.
> 
> By default, LLC task aggregation during wake-up is disabled.
> Conversely, cache-aware load balancing is enabled by default.
> For easier comparison, two scheduler features are introduced:
> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
> wake up and cache-aware load balancing, respectively. By default,
> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
> is only done on load balancing.

Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
LLC on this platform spans 4 threads.

schbench:
                        baseline (sd%)        baseline+cacheaware (sd%)      %change
Lat 50.0th-worker-1        6.33 (24.12%)           6.00 (28.87%)               5.21%
Lat 90.0th-worker-1        7.67 ( 7.53%)           7.67 (32.83%)               0.00%
Lat 99.0th-worker-1        8.67 ( 6.66%)           9.33 (37.63%)              -7.61%
Lat 99.9th-worker-1       21.33 (63.99%)          12.33 (28.47%)              42.19%

Lat 50.0th-worker-2        4.33 (13.32%)           5.67 (10.19%)             -30.95%
Lat 90.0th-worker-2        5.67 (20.38%)           7.67 ( 7.53%)             -35.27%
Lat 99.0th-worker-2        7.33 ( 7.87%)           8.33 ( 6.93%)             -13.64%
Lat 99.9th-worker-2       11.67 (24.74%)          10.33 (11.17%)              11.48%

Lat 50.0th-worker-4        5.00 ( 0.00%)           7.00 ( 0.00%)             -40.00%
Lat 90.0th-worker-4        7.00 ( 0.00%)           9.67 ( 5.97%)             -38.14%
Lat 99.0th-worker-4        8.00 ( 0.00%)          11.33 (13.48%)             -41.62%
Lat 99.9th-worker-4       10.33 ( 5.59%)          14.00 ( 7.14%)             -35.53%

Lat 50.0th-worker-8        4.33 (13.32%)           5.67 (10.19%)             -30.95%
Lat 90.0th-worker-8        6.33 (18.23%)           8.67 ( 6.66%)             -36.99%
Lat 99.0th-worker-8        7.67 ( 7.53%)          10.33 ( 5.59%)             -34.69%
Lat 99.9th-worker-8       10.00 (10.00%)          12.33 ( 4.68%)             -23.30%

Lat 50.0th-worker-16       4.00 ( 0.00%)           5.00 ( 0.00%)             -25.00%
Lat 90.0th-worker-16       6.33 ( 9.12%)           7.67 ( 7.53%)             -21.21%
Lat 99.0th-worker-16       8.00 ( 0.00%)          10.33 ( 5.59%)             -29.13%
Lat 99.9th-worker-16      12.00 ( 8.33%)          13.33 ( 4.33%)             -11.08%

Lat 50.0th-worker-32       5.00 ( 0.00%)           5.33 (10.83%)              -6.60%
Lat 90.0th-worker-32       7.00 ( 0.00%)           8.67 (17.63%)             -23.86%
Lat 99.0th-worker-32      10.67 (14.32%)          12.67 ( 4.56%)             -18.75%
Lat 99.9th-worker-32      14.67 ( 3.94%)          19.00 (13.93%)             -29.49%

Lat 50.0th-worker-64       5.33 (10.83%)           6.67 ( 8.66%)             -25.14%
Lat 90.0th-worker-64      10.00 (17.32%)          14.33 ( 4.03%)             -43.30%
Lat 99.0th-worker-64      14.00 ( 7.14%)          16.67 ( 3.46%)             -19.07%
Lat 99.9th-worker-64      55.00 (56.69%)          47.00 (61.92%)              14.55%

Lat 50.0th-worker-128      8.00 ( 0.00%)           8.67 (13.32%)              -8.38%
Lat 90.0th-worker-128     13.33 ( 4.33%)          14.33 ( 8.06%)              -7.50%
Lat 99.0th-worker-128     16.00 ( 0.00%)          20.00 ( 8.66%)             -25.00%
Lat 99.9th-worker-128   2258.33 (83.80%)        2974.67 (21.82%)             -31.72%

Lat 50.0th-worker-256     47.67 ( 2.42%)          45.33 ( 3.37%)               4.91%
Lat 90.0th-worker-256   3470.67 ( 1.88%)        3558.67 ( 0.47%)              -2.54%
Lat 99.0th-worker-256   9040.00 ( 2.76%)        9050.67 ( 0.41%)              -0.12%
Lat 99.9th-worker-256  13824.00 (20.07%)       13104.00 ( 6.84%)               5.21%

The above data shows mostly regression both in the lesser and
higher load cases.


Hackbench pipe:

Pairs   Baseline Avg (s) (Std%)     Patched Avg (s) (Std%)      % Change
2       2.987 (1.19%)               2.414 (17.99%)              24.06%
4       7.702 (12.53%)              7.228 (18.37%)               6.16%
8       14.141 (1.32%)              13.109 (1.46%)               7.29%
15      27.571 (6.53%)              29.460 (8.71%)              -6.84%
30      65.118 (4.49%)              61.352 (4.00%)               5.78%
45      105.086 (9.75%)             97.970 (4.26%)               6.77%
60      149.221 (6.91%)             154.176 (4.17%)             -3.32%
75      199.278 (1.21%)             198.680 (1.37%)              0.30%

A lot of run to run variation is seen in hackbench runs. So hard to tell
on the performance but looks better than schbench.

In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
when compared to platforms like sapphire rapids and Milan. Didn't go
through this series yet. Will go through and try to understand why
schbench is not happy on Power systems.

Meanwhile, Wanted to know your thoughts on how does smaller LLC
size get impacted with this patch?

Thanks,
Madadi Vineeth Reddy


> 
> With above default settings, task migrations occur less frequently
> and no longer happen in the latency-sensitive wake-up path.
> 

[..snip..]

> 
> Chen Yu (3):
>   sched: Several fixes for cache aware scheduling
>   sched: Avoid task migration within its preferred LLC
>   sched: Save the per LLC utilization for better cache aware scheduling
> 
> K Prateek Nayak (1):
>   sched: Avoid calculating the cpumask if the system is overloaded
> 
> Peter Zijlstra (1):
>   sched: Cache aware load-balancing
> 
> Tim Chen (15):
>   sched: Add hysteresis to switch a task's preferred LLC
>   sched: Add helper function to decide whether to allow cache aware
>     scheduling
>   sched: Set up LLC indexing
>   sched: Introduce task preferred LLC field
>   sched: Calculate the number of tasks that have LLC preference on a
>     runqueue
>   sched: Introduce per runqueue task LLC preference counter
>   sched: Calculate the total number of preferred LLC tasks during load
>     balance
>   sched: Tag the sched group as llc_balance if it has tasks prefer other
>     LLC
>   sched: Introduce update_llc_busiest() to deal with groups having
>     preferred LLC tasks
>   sched: Introduce a new migration_type to track the preferred LLC load
>     balance
>   sched: Consider LLC locality for active balance
>   sched: Consider LLC preference when picking tasks from busiest queue
>   sched: Do not migrate task if it is moving out of its preferred LLC
>   sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>   sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>     up
> 
>  include/linux/mm_types.h       |  44 ++
>  include/linux/sched.h          |   8 +
>  include/linux/sched/topology.h |   3 +
>  init/Kconfig                   |   4 +
>  init/init_task.c               |   3 +
>  kernel/fork.c                  |   5 +
>  kernel/sched/core.c            |  25 +-
>  kernel/sched/debug.c           |   4 +
>  kernel/sched/fair.c            | 859 ++++++++++++++++++++++++++++++++-
>  kernel/sched/features.h        |   3 +
>  kernel/sched/sched.h           |  23 +
>  kernel/sched/topology.c        |  29 ++
>  12 files changed, 982 insertions(+), 28 deletions(-)
>