[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a08e9fe6-c3be-4818-bff0-7ed350b3438a@intel.com>
Date: Wed, 15 Oct 2025 13:38:03 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Tim Chen <tim.c.chen@...ux.intel.com>, Madadi Vineeth Reddy
<vineethr@...ux.ibm.com>
CC: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>, "K
Prateek Nayak" <kprateek.nayak@....com>, "Gautham R . Shenoy"
<gautham.shenoy@....com>, Vincent Guittot <vincent.guittot@...aro.org>, "Juri
Lelli" <juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, "Mel
Gorman" <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, "Hillf
Danton" <hdanton@...a.com>, Shrikanth Hegde <sshegde@...ux.ibm.com>,
"Jianyong Wu" <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>, Len
Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>, Zhao Liu
<zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>, Libo Chen
<libo.chen@...cle.com>, Adam Li <adamli@...amperecomputing.com>, Tim Chen
<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Yangyu Chen
<cyy@...self.name>, <haoxing990@...il.com>
Subject: Re: [PATCH 00/19] Cache Aware Scheduling
On 10/15/2025 5:48 AM, Tim Chen wrote:
> On Tue, 2025-10-14 at 17:43 +0530, Madadi Vineeth Reddy wrote:
>> Hi Tim,
>> Thanks for the patch.
>>
>> On 11/10/25 23:54, Tim Chen wrote:
[snip]
>>> [Genoa details]
>>> [ChaCha20-xiangshan]
>>> ChaCha20-xiangshan is a simple benchmark using a static build of an
>>> 8-thread Verilator of XiangShan(RISC-V). The README file can be
>>> found here[2]. The score depends on how aggressive the user set the
>>> /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
>>> there is no much difference observed. While setting the
>>> /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
>>> observed.
>>>
>>> baseline:
>>> Host time spent: 50,868ms
>>>
>>> sched_cache:
>>> Host time spent: 28,349ms
>>>
>>> The time has been reduced by 44%.
>>
>> Milan showed no improvement across all benchmarks, which could be due to the
>> CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
>> optimization to be effective. Moreover there could be overhead due to additional
>> computations.
>>
>> ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
>> due to having relatively lesser thread count. Please provide the numbers
>> with default values too. Would like to know numbers on varying loads.
>
> I'll ask Chen Yu who did the Xiangshan experiments if he has those numbers.
>
Madadi, do you mean the performance score number or active thread number
when llc_aggr_tolerance is set to 1(default)?
The score is around with sched_cache and llc_aggr_tolerance set to 1.
The active number is 128 per process, and there are 8 processes when
launching the benchmark. I suppose the 128 comes from the number
of online CPUs. Please let me know if you need more data.
Cced Yangyu who's the author of this benchmark.
ls -l /proc/14460/task/ | grep -c '^d'
128
>>
>> In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
>> expecting improvements here but will run some workloads and share the data.
>>
>> Not gone through the entire series yet but are the situations like say in two
>> NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
>> which takes precedence?
>
> We take preferred NUMA node in the consideration but we do not force task to
> go to the preferred node.
>
> I remembered initially we limited the consideration to only LLCs in the
> preferred node. But we encountered regressions in hackbench and schbench,
> because when the preferred node don't have any occupancy resulting in preferred LLC
> to be set to -1 (no preference), and resulted in extra task migrations.
> And also the preferred node for hackbench and schbench was volatile
> as they have small memory footprint. Chen Yu, please chime in if there
> were other reasons you remembered.
>
Since the preferred NUMA node is per task, while the preferred LLC
is per process, scanning only the current task's preferred node
would lead to cross-node migration. This is because the process's
preferred LLC may not reside within the current task's preferred
node. Such a scenario could leave curr_m_a_occ at 0, and any LLC
with an occupancy > 0 would then trigger a preferred LLC switch.
> We'll need to revisit this part of the code to take care of such
> corner case. I think ideally we should move tasks to the least loaded LLC
> in the preferred node (even if no LLCs have occupancy in the preferred node),
> as long as preferred NUMA node don't changes too often.
>
>
Then we might need to introduce a new member in mm_struct to store the old
occupancy, curr_m_a_occ, so that we can reliably compare the old and new
occupancy - to avoid the 0 value of curr_m_a_occ.
>>
>> Also, what about the workloads that don't share data like stress-ng?
>>
The stream is single process stressing the memory without any share
data, we did not observe any difference on stream. We can launch more
tests on stress-ng.
thanks,
Chenyu>
> We can test those. Ideally the controls to prevent over aggregation to preferred LLC
> would keep stress-ng happy.
>
>> It will
>> be good to make sure that most other workloads don't suffer. As mentioned,
>> per process knob for llc_aggr_tolerance could help.
>
> Agree. We are planning to add per process knob for the next version. One thought is to use
> prctl. Any other suggestions are welcome.
>
Powered by blists - more mailing lists