linux-kernel - Re: [PATCH 00/19] Cache Aware Scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <218a4324-c28c-4068-8526-3f27a55a2e70@linux.ibm.com>
Date: Wed, 15 Oct 2025 23:56:41 +0530
From: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        Hillf Danton <hdanton@...a.com>,
        Shrikanth Hegde <sshegde@...ux.ibm.com>,
        Jianyong Wu <jianyong.wu@...look.com>, Yangyu Chen <cyy@...self.name>,
        Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
        Len Brown <len.brown@...el.com>, Aubrey Li <aubrey.li@...el.com>,
        Zhao Liu <zhao1.liu@...el.com>, Chen Yu <yu.chen.surf@...il.com>,
        Libo Chen <libo.chen@...cle.com>,
        Adam Li <adamli@...amperecomputing.com>,
        Tim Chen <tim.c.chen@...el.com>, linux-kernel@...r.kernel.org,
        haoxing990@...il.com, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH 00/19] Cache Aware Scheduling

On 15/10/25 11:08, Chen, Yu C wrote:
> On 10/15/2025 5:48 AM, Tim Chen wrote:
>> On Tue, 2025-10-14 at 17:43 +0530, Madadi Vineeth Reddy wrote:
>>> Hi Tim,
>>> Thanks for the patch.
>>>
>>> On 11/10/25 23:54, Tim Chen wrote:
> 
> [snip]
> 
>>>> [Genoa details]
>>>> [ChaCha20-xiangshan]
>>>> ChaCha20-xiangshan is a simple benchmark using a static build of an
>>>> 8-thread Verilator of XiangShan(RISC-V). The README file can be
>>>> found here[2]. The score depends on how aggressive the user set the
>>>> /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
>>>> there is no much difference observed. While setting the
>>>> /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
>>>> observed.
>>>>
>>>> baseline:
>>>> Host time spent: 50,868ms
>>>>
>>>> sched_cache:
>>>> Host time spent: 28,349ms
>>>>
>>>> The time has been reduced by 44%.
>>>
>>> Milan showed no improvement across all benchmarks, which could be due to the
>>> CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
>>> optimization to be effective. Moreover there could be overhead due to additional
>>> computations.
>>>
>>> ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
>>> due to having relatively lesser thread count. Please provide the numbers
>>> with default values too. Would like to know numbers on varying loads.
>>
>> I'll ask Chen Yu who did the Xiangshan experiments if he has those numbers.
>>
> 
> Madadi, do you mean the performance score number or active thread number
>  when llc_aggr_tolerance is set to 1(default)?
> The score is around with sched_cache and llc_aggr_tolerance set to 1.
> The active number is 128 per process, and there are 8 processes when
> launching the benchmark. I suppose the 128 comes from the number
> of online CPUs. Please let me know if you need more data.
> 
> Cced Yangyu who's the author of this benchmark.

I mean the benchmark result with default value of llc_aggr_tolerance on Genoa
in comparison to baseline. Knowing number of threads also helps to understand
the impact. 

> 
> ls -l /proc/14460/task/ | grep -c '^d'
> 128
> 
>>>
>>> In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
>>> expecting improvements here but will run some workloads and share the data.
>>>
>>> Not gone through the entire series yet but are the situations like say in two
>>> NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
>>> which takes precedence?
>>
>> We take preferred NUMA node in the consideration but we do not force task to
>> go to the preferred node.
>>
>> I remembered initially we limited the consideration to only LLCs in the
>> preferred node. But we encountered regressions in hackbench and schbench,
>> because when the preferred node don't have any occupancy resulting in preferred LLC
>> to be set to -1 (no preference), and resulted in extra task migrations.
>> And also the preferred node for hackbench and schbench was volatile
>> as they have small memory footprint.  Chen Yu, please chime in if there
>> were other reasons you remembered.
>>
> 
> Since the preferred NUMA node is per task, while the preferred LLC
> is per process, scanning only the current task's preferred node
> would lead to cross-node migration. This is because the process's
> preferred LLC may not reside within the current task's preferred
> node. Such a scenario could leave curr_m_a_occ at 0, and any LLC
> with an occupancy > 0 would then trigger a preferred LLC switch.

Understood. Thanks for the context.

> 
>> We'll need to revisit this part of the code to take care of such
>> corner case. I think ideally we should move tasks to the least loaded LLC
>> in the preferred node (even if no LLCs have occupancy in the preferred node),
>> as long as preferred NUMA node don't changes too often.
>>
>>
> 
> Then we might need to introduce a new member in mm_struct to store the old
> occupancy, curr_m_a_occ, so that we can reliably compare the old and new
> occupancy - to avoid the 0 value of curr_m_a_occ.
> 
>>>
>>> Also, what about the workloads that don't share data like stress-ng?
>>>
> 
> The stream is single process stressing the memory without any share
> data, we did not observe any difference on stream. We can launch more
> tests on stress-ng.
> 

That would be helpful.

Thanks,
Madadi Vineeth Reddy

> thanks,
> Chenyu>
>> We can test those.  Ideally the controls to prevent over aggregation to preferred LLC
>> would keep stress-ng happy.
>>
>>> It will
>>> be good to make sure that most other workloads don't suffer. As mentioned,
>>> per process knob for llc_aggr_tolerance could help.
>>
>> Agree. We are planning to add per process knob for the next version.  One thought is to use
>> prctl. Any other suggestions are welcome.
>>
>