linux-kernel - Re: [RFC patch v3 07/20] sched: Add helper function to decide whether to allow cache aware scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f102210a-66b1-45da-b553-f68a33360736@oracle.com>
Date: Wed, 9 Jul 2025 14:31:43 -0700
From: Libo Chen <libo.chen@...cle.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
        Tim Chen <tim.c.chen@...el.com>,
        Vincent Guittot
 <vincent.guittot@...aro.org>,
        Abel Wu <wuyun.abel@...edance.com>,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
        Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
        linux-kernel@...r.kernel.org, Tim Chen <tim.c.chen@...ux.intel.com>,
        Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
        K Prateek Nayak <kprateek.nayak@....com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>
Subject: Re: [RFC patch v3 07/20] sched: Add helper function to decide whether
 to allow cache aware scheduling



On 7/9/25 07:41, Chen, Yu C wrote:
> On 7/9/2025 1:22 AM, Libo Chen wrote:
>>
>>
>> On 7/8/25 01:29, Chen, Yu C wrote:
>>> On 7/8/2025 8:41 AM, Libo Chen wrote:
>>>> Hi Tim and Chenyu,
>>>>
>>>>
>>>> On 6/18/25 11:27, Tim Chen wrote:
>>>>> Cache-aware scheduling is designed to aggregate threads into their
>>>>> preferred LLC, either via the task wake up path or the load balancing
>>>>> path. One side effect is that when the preferred LLC is saturated,
>>>>> more threads will continue to be stacked on it, degrading the workload's
>>>>> latency. A strategy is needed to prevent this aggregation from going too
>>>>> far such that the preferred LLC is too overloaded.
>>>>>
>>>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>>>> migration policy:
>>>>>
>>>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>>>>      are not too busy (<50% utilization, tunable), or the preferred
>>>>>      LLC will not be too out of balanced from the non preferred LLC
>>>>>      (>20% utilization, tunable, close to imbalance_pct of the LLC
>>>>>      domain).
>>>>> 2) Allow a task to be moved from the preferred LLC to the
>>>>>      non-preferred one if the non-preferred LLC will not be too out
>>>>>      of balanced from the preferred prompting an aggregation task
>>>>>      migration later.  We are still experimenting with the aggregation
>>>>>      and migration policy. Some other possibilities are policy based
>>>>>      on LLC's load or average number of tasks running.  Those could
>>>>>      be tried out by tweaking _get_migrate_hint().
>>>>>
>>>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>>>> +__read_mostly unsigned int sysctl_llc_aggr_cap       = 50;
>>>>> +__read_mostly unsigned int sysctl_llc_aggr_imb       = 20;
>>>>> +
>>>>
>>>>
>>>> I think this patch has a great potential.
>>>>
>>>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>>>> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
>>>> preferences for llc stacking, they can all be running in the same system at the
>>>> same time. This way you can offer a greater deal of optimization without much
>>>> burden to others.
>>>
>>> Yes, this doable. It can be evaluated after the global generic strategy
>>> has been verified to work, like NUMA balancing :)
>>>
>>
>> I will run some real-world workloads and get back to you (may take some time)
>>
> 
> Thanks. It seems that there are pros and cons for different
> workloads and we are evaluating adding the RSS/active nr_running
> per process to deal with different type of workloads.
> 
>>>>
>>>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
>>>
>>> Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?
>>>
>>
>> Ah I was thinking sysctl_llc_aggr_imb alone can help reduce overstacking on
>> target LLC from a few hyperactive wakees (may consider to ratelimit those
>> wakees as a solution), but just realize this can affect lb as well and doesn't
>> really reduce overheads from frequent wakeups (no good idea on top of my head
>> but we should find a better solution than sched_feat to address the overhead issue).
>>
btw just for correction, I meant wakers here not wakees 
>>
>>
>>>> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
>>>>
>>>
>>> My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
>>> might still consider other aspects, like if that target LLC's utilization has
>>> exceeded 50% or not.
>>>
>>
>> which can be controlled by sysctl_llc_aggr_cap, right? Okay so if both LLCs have
>> <$(sysctl_llc_aggr_cap)% utilization, should sysctl_llc_aggr_cap be the only
>> determining factor here barring NUMA balancing?
>>
> 
> If both LLC are under (sysctl_llc_aggr_cap)%, then the strategy is still to allow
> task to be aggregated to its preferred LLC, by either asking the task to not be
> pulled out of its preferred LLC, or migrate task to its preferred LLC,
> in _get_migrate_hint().
> 
Ok, got it. It looks to me sysctl_llc_aggr_imb and sysctl_llc_aggr_cap can have quite
an impact on perf. I will play around with different values a bit.

Libo


> Thanks,
> Chenyu