linux-kernel - Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0c0bd184-6926-424b-9ef2-f3910be18073@intel.com>
Date: Wed, 9 Jul 2025 22:41:38 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Libo Chen <libo.chen@...cle.com>
CC: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Abel Wu <wuyun.abel@...edance.com>, "Madadi
 Vineeth Reddy" <vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>,
	"Len Brown" <len.brown@...el.com>, <linux-kernel@...r.kernel.org>, Tim Chen
	<tim.c.chen@...ux.intel.com>, Peter Zijlstra <peterz@...radead.org>, "Ingo
 Molnar" <mingo@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>
Subject: Re: [RFC patch v3 07/20] sched: Add helper function to decide whether
 to allow cache aware scheduling

On 7/9/2025 1:22 AM, Libo Chen wrote:
> 
> 
> On 7/8/25 01:29, Chen, Yu C wrote:
>> On 7/8/2025 8:41 AM, Libo Chen wrote:
>>> Hi Tim and Chenyu,
>>>
>>>
>>> On 6/18/25 11:27, Tim Chen wrote:
>>>> Cache-aware scheduling is designed to aggregate threads into their
>>>> preferred LLC, either via the task wake up path or the load balancing
>>>> path. One side effect is that when the preferred LLC is saturated,
>>>> more threads will continue to be stacked on it, degrading the workload's
>>>> latency. A strategy is needed to prevent this aggregation from going too
>>>> far such that the preferred LLC is too overloaded.
>>>>
>>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>>> migration policy:
>>>>
>>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>>      are not too busy (<50% utilization, tunable), or the preferred
>>>>      LLC will not be too out of balanced from the non preferred LLC
>>>>      (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>>      domain).
>>>> 2) Allow a task to be moved from the preferred LLC to the
>>>>      non-preferred one if the non-preferred LLC will not be too out
>>>>      of balanced from the preferred prompting an aggregation task
>>>>      migration later.  We are still experimenting with the aggregation
>>>>      and migration policy. Some other possibilities are policy based
>>>>      on LLC's load or average number of tasks running.  Those could
>>>>      be tried out by tweaking _get_migrate_hint().
>>>>
>>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>>> +
>>>
>>>
>>> I think this patch has a great potential.
>>>
>>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>>> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
>>> preferences for llc stacking, they can all be running in the same system at the
>>> same time. This way you can offer a greater deal of optimization without much
>>> burden to others.
>>
>> Yes, this doable. It can be evaluated after the global generic strategy
>> has been verified to work, like NUMA balancing :)
>>
> 
> I will run some real-world workloads and get back to you (may take some time)
> 

Thanks. It seems that there are pros and cons for different
workloads and we are evaluating adding the RSS/active nr_running
per process to deal with different type of workloads.

>>>
>>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
>>
>> Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?
>>
> 
> Ah I was thinking sysctl_llc_aggr_imb alone can help reduce overstacking on
> target LLC from a few hyperactive wakees (may consider to ratelimit those
> wakees as a solution), but just realize this can affect lb as well and doesn't
> really reduce overheads from frequent wakeups (no good idea on top of my head
> but we should find a better solution than sched_feat to address the overhead issue).
> 
> 
> 
>>> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
>>>
>>
>> My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
>> might still consider other aspects, like if that target LLC's utilization has
>> exceeded 50% or not.
>>
> 
> which can be controlled by sysctl_llc_aggr_cap, right? Okay so if both LLCs have
> <$(sysctl_llc_aggr_cap)% utilization, should sysctl_llc_aggr_cap be the only
> determining factor here barring NUMA balancing?
> 

If both LLC are under (sysctl_llc_aggr_cap)%, then the strategy is still 
to allow
task to be aggregated to its preferred LLC, by either asking the task to 
not be
pulled out of its preferred LLC, or migrate task to its preferred LLC,
in _get_migrate_hint().

Thanks,
Chenyu