[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9e78f54c-f993-4505-814d-b8815f5c6bf0@oracle.com>
Date: Tue, 8 Jul 2025 10:22:36 -0700
From: Libo Chen <libo.chen@...cle.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
Tim Chen <tim.c.chen@...el.com>,
Vincent Guittot
<vincent.guittot@...aro.org>,
Abel Wu <wuyun.abel@...edance.com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>,
Hillf Danton <hdanton@...a.com>, Len Brown <len.brown@...el.com>,
linux-kernel@...r.kernel.org, Tim Chen <tim.c.chen@...ux.intel.com>,
Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>
Subject: Re: [RFC patch v3 07/20] sched: Add helper function to decide whether
to allow cache aware scheduling
On 7/8/25 01:29, Chen, Yu C wrote:
> On 7/8/2025 8:41 AM, Libo Chen wrote:
>> Hi Tim and Chenyu,
>>
>>
>> On 6/18/25 11:27, Tim Chen wrote:
>>> Cache-aware scheduling is designed to aggregate threads into their
>>> preferred LLC, either via the task wake up path or the load balancing
>>> path. One side effect is that when the preferred LLC is saturated,
>>> more threads will continue to be stacked on it, degrading the workload's
>>> latency. A strategy is needed to prevent this aggregation from going too
>>> far such that the preferred LLC is too overloaded.
>>>
>>> Introduce helper function _get_migrate_hint() to implement the LLC
>>> migration policy:
>>>
>>> 1) A task is aggregated to its preferred LLC if both source/dest LLC
>>> are not too busy (<50% utilization, tunable), or the preferred
>>> LLC will not be too out of balanced from the non preferred LLC
>>> (>20% utilization, tunable, close to imbalance_pct of the LLC
>>> domain).
>>> 2) Allow a task to be moved from the preferred LLC to the
>>> non-preferred one if the non-preferred LLC will not be too out
>>> of balanced from the preferred prompting an aggregation task
>>> migration later. We are still experimenting with the aggregation
>>> and migration policy. Some other possibilities are policy based
>>> on LLC's load or average number of tasks running. Those could
>>> be tried out by tweaking _get_migrate_hint().
>>>
>>> The function _get_migrate_hint() returns migration suggestions for the upper-le
>>> +__read_mostly unsigned int sysctl_llc_aggr_cap = 50;
>>> +__read_mostly unsigned int sysctl_llc_aggr_imb = 20;
>>> +
>>
>>
>> I think this patch has a great potential.
>>
>> Since _get_migrate_hint() is tied to an individual task anyway, why not add a
>> per-task llc_aggr_imb which defaults to the sysctl one? Tasks have different
>> preferences for llc stacking, they can all be running in the same system at the
>> same time. This way you can offer a greater deal of optimization without much
>> burden to others.
>
> Yes, this doable. It can be evaluated after the global generic strategy
> has been verified to work, like NUMA balancing :)
>
I will run some real-world workloads and get back to you (may take some time)
>>
>> Also with sysctl_llc_aggr_imb, do we really need SCHED_CACHE_WAKE?
>
> Do you mean the SCHED_CACHE_WAKE or SCHED_CACHE_LB?
>
Ah I was thinking sysctl_llc_aggr_imb alone can help reduce overstacking on
target LLC from a few hyperactive wakees (may consider to ratelimit those
wakees as a solution), but just realize this can affect lb as well and doesn't
really reduce overheads from frequent wakeups (no good idea on top of my head
but we should find a better solution than sched_feat to address the overhead issue).
>> Does setting sysctl_llc_aggr_imb to 0 basically say no preference for either LLC, no?
>>
>
> My understanding is that, if sysctl_llc_aggr_imb is 0, the task aggregation
> might still consider other aspects, like if that target LLC's utilization has
> exceeded 50% or not.
>
which can be controlled by sysctl_llc_aggr_cap, right? Okay so if both LLCs have
<$(sysctl_llc_aggr_cap)% utilization, should sysctl_llc_aggr_cap be the only
determining factor here barring NUMA balancing?
Libo
> thanks,
> Chenyu> Thanks,
>> Libo
>>
>>> +static enum llc_mig_hint _get_migrate_hint(int src_cpu, int dst_cpu,
>>> + unsigned long tsk_util,
>>> + bool to_pref)
>>> +{
>>> + unsigned long src_util, dst_util, src_cap, dst_cap;
>>> +
>>> + if (cpus_share_cache(src_cpu, dst_cpu))
>>> + return mig_allow;
>>> +
>>> + if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
>>> + !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
>>> + return mig_allow;
>>> +
>>> + if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> + !fits_llc_capacity(src_util, src_cap))
>>> + return mig_ignore;
>>> +
>>> + src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
>>> + dst_util = dst_util + tsk_util;
>>> + if (to_pref) {
>>> + /*
>>> + * sysctl_llc_aggr_imb is the imbalance allowed between
>>> + * preferred LLC and non-preferred LLC.
>>> + * Don't migrate if we will get preferred LLC too
>>> + * heavily loaded and if the dest is much busier
>>> + * than the src, in which case migration will
>>> + * increase the imbalance too much.
>>> + */
>>> + if (!fits_llc_capacity(dst_util, dst_cap) &&
>>> + util_greater(dst_util, src_util))
>>> + return mig_forbid;
>>> + } else {
>>> + /*
>>> + * Don't migrate if we will leave preferred LLC
>>> + * too idle, or if this migration leads to the
>>> + * non-preferred LLC falls within sysctl_aggr_imb percent
>>> + * of preferred LLC, leading to migration again
>>> + * back to preferred LLC.
>>> + */
>>> + if (fits_llc_capacity(src_util, src_cap) ||
>>> + !util_greater(src_util, dst_util))
>>> + return mig_forbid;
>>> + }
>>> + return mig_allow;
>>> +}
>>
>>
Powered by blists - more mailing lists