[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <02a4da67-f681-425d-b3dd-3ddf10265a64@linux.ibm.com>
Date: Tue, 24 Jun 2025 23:17:43 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Tim Chen <tim.c.chen@...el.com>,
Vincent Guittot
<vincent.guittot@...aro.org>,
Libo Chen <libo.chen@...cle.com>, Abel Wu <wuyun.abel@...edance.com>,
Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
linux-kernel@...r.kernel.org,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Tim Chen <tim.c.chen@...ux.intel.com>,
Peter Zijlstra
<peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>
Subject: Re: [RFC patch v3 00/20] Cache aware scheduling
Hi Chen,
On 22/06/25 06:09, Chen, Yu C wrote:
> On 6/21/2025 3:25 AM, Madadi Vineeth Reddy wrote:
>> Hi Tim,
>>
>> On 18/06/25 23:57, Tim Chen wrote:
>>> This is the third revision of the cache aware scheduling patches,
>>> based on the original patch proposed by Peter[1].
>>> The goal of the patch series is to aggregate tasks sharing data
>>> to the same cache domain, thereby reducing cache bouncing and
>>> cache misses, and improve data access efficiency. In the current
>>> implementation, threads within the same process are considered
>>> as entities that potentially share resources.
>>> In previous versions, aggregation of tasks were done in the
>>> wake up path, without making load balancing paths aware of
>>> LLC (Last-Level-Cache) preference. This led to the following
>>> problems:
>>>
>>> 1) Aggregation of tasks during wake up led to load imbalance
>>> between LLCs
>>> 2) Load balancing tried to even out the load between LLCs
>>> 3) Wake up tasks aggregation happened at a faster rate and
>>> load balancing moved tasks in opposite directions, leading
>>> to continuous and excessive task migrations and regressions
>>> in benchmarks like schbench.
>>>
>>> In this version, load balancing is made cache-aware. The main
>>> idea of cache-aware load balancing consists of two parts:
>>>
>>> 1) Identify tasks that prefer to run on their hottest LLC and
>>> move them there.
>>> 2) Prevent generic load balancing from moving a task out of
>>> its hottest LLC.
>>>
>>> By default, LLC task aggregation during wake-up is disabled.
>>> Conversely, cache-aware load balancing is enabled by default.
>>> For easier comparison, two scheduler features are introduced:
>>> SCHED_CACHE_WAKE and SCHED_CACHE_LB, which control cache-aware
>>> wake up and cache-aware load balancing, respectively. By default,
>>> NO_SCHED_CACHE_WAKE and SCHED_CACHE_LB are set, so tasks aggregation
>>> is only done on load balancing.
>>
>> Tested this patch series on a Power11 system with 28 cores and 224 CPUs.
>> LLC on this platform spans 4 threads.
>>
>> schbench:
>> baseline (sd%) baseline+cacheaware (sd%) %change
>> Lat 50.0th-worker-1 6.33 (24.12%) 6.00 (28.87%) 5.21%
>> Lat 90.0th-worker-1 7.67 ( 7.53%) 7.67 (32.83%) 0.00%
>> Lat 99.0th-worker-1 8.67 ( 6.66%) 9.33 (37.63%) -7.61%
>> Lat 99.9th-worker-1 21.33 (63.99%) 12.33 (28.47%) 42.19%
>>
>> Lat 50.0th-worker-2 4.33 (13.32%) 5.67 (10.19%) -30.95%
>> Lat 90.0th-worker-2 5.67 (20.38%) 7.67 ( 7.53%) -35.27%
>> Lat 99.0th-worker-2 7.33 ( 7.87%) 8.33 ( 6.93%) -13.64%
>> Lat 99.9th-worker-2 11.67 (24.74%) 10.33 (11.17%) 11.48%
>>
>> Lat 50.0th-worker-4 5.00 ( 0.00%) 7.00 ( 0.00%) -40.00%
>> Lat 90.0th-worker-4 7.00 ( 0.00%) 9.67 ( 5.97%) -38.14%
>> Lat 99.0th-worker-4 8.00 ( 0.00%) 11.33 (13.48%) -41.62%
>> Lat 99.9th-worker-4 10.33 ( 5.59%) 14.00 ( 7.14%) -35.53%
>>
>> Lat 50.0th-worker-8 4.33 (13.32%) 5.67 (10.19%) -30.95%
>> Lat 90.0th-worker-8 6.33 (18.23%) 8.67 ( 6.66%) -36.99%
>> Lat 99.0th-worker-8 7.67 ( 7.53%) 10.33 ( 5.59%) -34.69%
>> Lat 99.9th-worker-8 10.00 (10.00%) 12.33 ( 4.68%) -23.30%
>>
>> Lat 50.0th-worker-16 4.00 ( 0.00%) 5.00 ( 0.00%) -25.00%
>> Lat 90.0th-worker-16 6.33 ( 9.12%) 7.67 ( 7.53%) -21.21%
>> Lat 99.0th-worker-16 8.00 ( 0.00%) 10.33 ( 5.59%) -29.13%
>> Lat 99.9th-worker-16 12.00 ( 8.33%) 13.33 ( 4.33%) -11.08%
>>
>> Lat 50.0th-worker-32 5.00 ( 0.00%) 5.33 (10.83%) -6.60%
>> Lat 90.0th-worker-32 7.00 ( 0.00%) 8.67 (17.63%) -23.86%
>> Lat 99.0th-worker-32 10.67 (14.32%) 12.67 ( 4.56%) -18.75%
>> Lat 99.9th-worker-32 14.67 ( 3.94%) 19.00 (13.93%) -29.49%
>>
>> Lat 50.0th-worker-64 5.33 (10.83%) 6.67 ( 8.66%) -25.14%
>> Lat 90.0th-worker-64 10.00 (17.32%) 14.33 ( 4.03%) -43.30%
>> Lat 99.0th-worker-64 14.00 ( 7.14%) 16.67 ( 3.46%) -19.07%
>> Lat 99.9th-worker-64 55.00 (56.69%) 47.00 (61.92%) 14.55%
>>
>> Lat 50.0th-worker-128 8.00 ( 0.00%) 8.67 (13.32%) -8.38%
>> Lat 90.0th-worker-128 13.33 ( 4.33%) 14.33 ( 8.06%) -7.50%
>> Lat 99.0th-worker-128 16.00 ( 0.00%) 20.00 ( 8.66%) -25.00%
>> Lat 99.9th-worker-128 2258.33 (83.80%) 2974.67 (21.82%) -31.72%
>>
>> Lat 50.0th-worker-256 47.67 ( 2.42%) 45.33 ( 3.37%) 4.91%
>> Lat 90.0th-worker-256 3470.67 ( 1.88%) 3558.67 ( 0.47%) -2.54%
>> Lat 99.0th-worker-256 9040.00 ( 2.76%) 9050.67 ( 0.41%) -0.12%
>> Lat 99.9th-worker-256 13824.00 (20.07%) 13104.00 ( 6.84%) 5.21%
>>
>> The above data shows mostly regression both in the lesser and
>> higher load cases.
>>
>>
>> Hackbench pipe:
>>
>> Pairs Baseline Avg (s) (Std%) Patched Avg (s) (Std%) % Change
>> 2 2.987 (1.19%) 2.414 (17.99%) 24.06%
>> 4 7.702 (12.53%) 7.228 (18.37%) 6.16%
>> 8 14.141 (1.32%) 13.109 (1.46%) 7.29%
>> 15 27.571 (6.53%) 29.460 (8.71%) -6.84%
>> 30 65.118 (4.49%) 61.352 (4.00%) 5.78%
>> 45 105.086 (9.75%) 97.970 (4.26%) 6.77%
>> 60 149.221 (6.91%) 154.176 (4.17%) -3.32%
>> 75 199.278 (1.21%) 198.680 (1.37%) 0.30%
>>
>> A lot of run to run variation is seen in hackbench runs. So hard to tell
>> on the performance but looks better than schbench.
>
> May I know if the cpu frequency was set at a fixed level and deep
> cpu idle states were disabled(I assume on power system it is called
> stop states?)
Deep cpu idle state is called 'cede' in PowerVM LPAR. I have not disabled
it.
>
>>
>> In Power 10 and Power 11, The LLC size is relatively smaller (4 CPUs)
>> when compared to platforms like sapphire rapids and Milan. Didn't go
>> through this series yet. Will go through and try to understand why
>> schbench is not happy on Power systems.
>>
>> Meanwhile, Wanted to know your thoughts on how does smaller LLC
>> size get impacted with this patch?
>>
>
> task aggregation on smaller LLC domain(both in terms of the
> number of CPUs and the size of LLC) might bring cache contention
> and hurt performance IMO. May I know what is the cache size on
> your system:
> lscpu | grep "L3 cache"
L3 cache: 224 MiB (56 instances)
>
> May I know if you tested it with:
> echo NO_SCHED_CACHE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_LB > /sys/kernel/debug/sched/features
>
> vs
>
> echo SCHED_CACHE > /sys/kernel/debug/sched/features
> echo NO_SCHED_CACHE_WAKE > /sys/kernel/debug/sched/features
> echo SCHED_CACHE_LB > /sys/kernel/debug/sched/features
>
I have tested with and without this patch series. Didn't change
any sched feature. So, the patched kernel was running with the default
settings:
SCHED_CACHE, NO_SCHED_CACHE_WAKE, and SCHED_CACHE_LB.
> And could you help check if setting /sys/kernel/debug/sched/llc_aggr_cap
> from 50 to some smaller values(25, etc) would help?
Will give it a try.
Thanks,
Madadi Vineeth Reddy
>
> thanks,
> Chenyu
>
>> Thanks,
>> Madadi Vineeth Reddy
>>
>>
>>>
>>> With above default settings, task migrations occur less frequently
>>> and no longer happen in the latency-sensitive wake-up path.
>>>
>>
>> [..snip..]
>>
>>>
>>> Chen Yu (3):
>>> sched: Several fixes for cache aware scheduling
>>> sched: Avoid task migration within its preferred LLC
>>> sched: Save the per LLC utilization for better cache aware scheduling
>>>
>>> K Prateek Nayak (1):
>>> sched: Avoid calculating the cpumask if the system is overloaded
>>>
>>> Peter Zijlstra (1):
>>> sched: Cache aware load-balancing
>>>
>>> Tim Chen (15):
>>> sched: Add hysteresis to switch a task's preferred LLC
>>> sched: Add helper function to decide whether to allow cache aware
>>> scheduling
>>> sched: Set up LLC indexing
>>> sched: Introduce task preferred LLC field
>>> sched: Calculate the number of tasks that have LLC preference on a
>>> runqueue
>>> sched: Introduce per runqueue task LLC preference counter
>>> sched: Calculate the total number of preferred LLC tasks during load
>>> balance
>>> sched: Tag the sched group as llc_balance if it has tasks prefer other
>>> LLC
>>> sched: Introduce update_llc_busiest() to deal with groups having
>>> preferred LLC tasks
>>> sched: Introduce a new migration_type to track the preferred LLC load
>>> balance
>>> sched: Consider LLC locality for active balance
>>> sched: Consider LLC preference when picking tasks from busiest queue
>>> sched: Do not migrate task if it is moving out of its preferred LLC
>>> sched: Introduce SCHED_CACHE_LB to control cache aware load balance
>>> sched: Introduce SCHED_CACHE_WAKE to control LLC aggregation on wake
>>> up
>>>
>>> include/linux/mm_types.h | 44 ++
>>> include/linux/sched.h | 8 +
>>> include/linux/sched/topology.h | 3 +
>>> init/Kconfig | 4 +
>>> init/init_task.c | 3 +
>>> kernel/fork.c | 5 +
>>> kernel/sched/core.c | 25 +-
>>> kernel/sched/debug.c | 4 +
>>> kernel/sched/fair.c | 859 ++++++++++++++++++++++++++++++++-
>>> kernel/sched/features.h | 3 +
>>> kernel/sched/sched.h | 23 +
>>> kernel/sched/topology.c | 29 ++
>>> 12 files changed, 982 insertions(+), 28 deletions(-)
>>>
>>
Powered by blists - more mailing lists