[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d2f5e1b602be1b9df09df7bc9b2d3895203dabb6.camel@linux.intel.com>
Date: Thu, 22 Jan 2026 12:43:31 -0800
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Yangyu Chen <cyy@...self.name>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>, Vincent Guittot
<vincent.guittot@...aro.org>
Cc: Chen Yu <yu.c.chen@...el.com>, Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, Madadi Vineeth
Reddy <vineethr@...ux.ibm.com>, Hillf Danton <hdanton@...a.com>, Shrikanth
Hegde <sshegde@...ux.ibm.com>, Jianyong Wu <jianyong.wu@...look.com>,
Tingyin Duan <tingyin.duan@...il.com>, Vern Hao <vernhao@...cent.com>,
Vern Hao <haoxing990@...il.com>, Len Brown <len.brown@...el.com>, Aubrey
Li <aubrey.li@...el.com>, Zhao Liu <zhao1.liu@...el.com>, Chen Yu
<yu.chen.surf@...il.com>, Adam Li <adamli@...amperecomputing.com>, Aaron Lu
<ziqianlu@...edance.com>, Tim Chen <tim.c.chen@...el.com>,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 03/23] sched/cache: Introduce helper functions to
enforce LLC migration policy
On Fri, 2026-01-23 at 02:13 +0800, Yangyu Chen wrote:
>
> On 4/12/2025 07:07, Tim Chen wrote:
> > From: Chen Yu <yu.c.chen@...el.com>
> >
> > Cache-aware scheduling aggregates threads onto their preferred LLC,
> > mainly through load balancing. When the preferred LLC becomes
> > saturated, more threads are still placed there, increasing latency.
> > A mechanism is needed to limit aggregation so that the preferred LLC
> > does not become overloaded.
> >
> > Introduce helper functions can_migrate_llc() and
> > can_migrate_llc_task() to enforce the LLC migration policy:
> >
> > 1. Aggregate a task to its preferred LLC if both source and
> > destination LLCs are not too busy (<50% utilization),
>
> Hi Chen Yu and Tim Chen,
>
> I would like to ask why here chooses 50% for LLC busy. For example, a
> common AMD Zen 3-5 has 8 cores per LLC. When these servers turned SMT
> off, an 8-thread process can not be scheduled within 1 LLC. I think here
> can be 100% by default.
For 100% you would be likely have over aggregation and contention on the LLC by aggregating everything
there. We tested some workload that has hot working set footprint that's a significant
fraction of the LLC and found that 100% would cause significant
degradation to performance. 50% is what we found to be a reasonable
default value. This is a tunnable so it could be changed by admin if
desired.
Tim
>
> Thanks,
> Yangyu Chen
>
> > or if doing so will not leave the preferred LLC much more
> > imbalanced than the non-preferred one (>20% utilization
> > difference, similar to imbalance_pct of the LLC domain).
> > 2. Allow moving a task from overloaded preferred LLC to a non preferred
> > LLC if this will not cause the non preferred LLC to become
> > too imbalanced to cause a later migration back.
> > 3. If both LLCs are too busy, let the generic load balance to spread
> > the tasks.
> >
> > Further (hysteresis)action could be taken in the future to prevent tasks
> > from being migrated into and out of the preferred LLC frequently (back and
> > forth): the threshold for migrating a task out of its preferred LLC should
> > be higher than that for migrating it into the LLC.
> >
> > Since aggregation tends to make the preferred LLC busier than others,
> > the imbalance tolerance is controlled by llc_imb_pct. If set to 0,
> > tasks may still aggregate to the preferred LLC as long as it is
> > not more utilized than the source LLC, preserving the preference.
> >
> > Co-developed-by: Tim Chen <tim.c.chen@...ux.intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> > Signed-off-by: Tim Chen <tim.c.chen@...ux.intel.com>
> > ---
> >
> > Notes:
> > v1->v2:
> > No change.
> >
> > kernel/sched/fair.c | 153 +++++++++++++++++++++++++++++++++++++++++++
> > kernel/sched/sched.h | 5 ++
> > 2 files changed, 158 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b9f336300f14..710ed9943d27 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1205,6 +1205,9 @@ static s64 update_se(struct rq *rq, struct sched_entity *se)
> > #define EPOCH_PERIOD (HZ / 100) /* 10 ms */
> > #define EPOCH_LLC_AFFINITY_TIMEOUT 5 /* 50 ms */
> >
> > +__read_mostly unsigned int llc_overload_pct = 50;
> > +__read_mostly unsigned int llc_imb_pct = 20;
> > +
> > static int llc_id(int cpu)
> > {
> > if (cpu < 0)
> > @@ -9623,6 +9626,27 @@ static inline int task_is_ineligible_on_dst_cpu(struct task_struct *p, int dest_
> > }
> >
> > #ifdef CONFIG_SCHED_CACHE
> > +/*
> > + * The margin used when comparing LLC utilization with CPU capacity.
> > + * Parameter llc_overload_pct determines the LLC load level where
> > + * active LLC aggregation is done.
> > + * Derived from fits_capacity().
> > + *
> > + * (default: ~50%)
> > + */
> > +#define fits_llc_capacity(util, max) \
> > + ((util) * 100 < (max) * llc_overload_pct)
> > +
> > +/*
> > + * The margin used when comparing utilization.
> > + * is 'util1' noticeably greater than 'util2'
> > + * Derived from capacity_greater().
> > + * Bias is in perentage.
> > + */
> > +/* Allows dst util to be bigger than src util by up to bias percent */
> > +#define util_greater(util1, util2) \
> > + ((util1) * 100 > (util2) * (100 + llc_imb_pct))
> > +
> > /* Called from load balancing paths with rcu_read_lock held */
> > static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
> > unsigned long *cap)
> > @@ -9638,6 +9662,135 @@ static __maybe_unused bool get_llc_stats(int cpu, unsigned long *util,
> >
> > return true;
> > }
> > +
> > +/*
> > + * Decision matrix according to the LLC utilization. To
> > + * decide whether we can do task aggregation across LLC.
> > + *
> > + * By default, 50% is the threshold to treat the LLC as busy,
> > + * and 20% is the utilization imbalance percentage to decide
> > + * if the preferred LLC is busier than the non-preferred LLC.
> > + * The hysteresis is used to avoid task bouncing between the
> > + * preferred LLC and the non-preferred LLC.
> > + *
> > + * 1. moving towards the preferred LLC, dst is the preferred
> > + * LLC, src is not.
> > + *
> > + * src \ dst 30% 40% 50% 60%
> > + * 30% Y Y Y N
> > + * 40% Y Y Y Y
> > + * 50% Y Y G G
> > + * 60% Y Y G G
> > + *
> > + * 2. moving out of the preferred LLC, src is the preferred
> > + * LLC, dst is not:
> > + *
> > + * src \ dst 30% 40% 50% 60%
> > + * 30% N N N N
> > + * 40% N N N N
> > + * 50% N N G G
> > + * 60% Y N G G
> > + *
> > + * src : src_util
> > + * dst : dst_util
> > + * Y : Yes, migrate
> > + * N : No, do not migrate
> > + * G : let the Generic load balance to even the load.
> > + *
> > + * The intention is that if both LLCs are quite busy, cache aware
> > + * load balance should not be performed, and generic load balance
> > + * should take effect. However, if one is busy and the other is not,
> > + * the preferred LLC capacity(50%) and imbalance criteria(20%) should
> > + * be considered to determine whether LLC aggregation should be
> > + * performed to bias the load towards the preferred LLC.
> > + */
> > +
> > +/* migration decision, 3 states are orthogonal. */
> > +enum llc_mig {
> > + mig_forbid = 0, /* N: Don't migrate task, respect LLC preference */
> > + mig_llc, /* Y: Do LLC preference based migration */
> > + mig_unrestricted /* G: Don't restrict generic load balance migration */
> > +};
> > +
> > +/*
> > + * Check if task can be moved from the source LLC to the
> > + * destination LLC without breaking cache aware preferrence.
> > + * src_cpu and dst_cpu are arbitrary CPUs within the source
> > + * and destination LLCs, respectively.
> > + */
> > +static enum llc_mig can_migrate_llc(int src_cpu, int dst_cpu,
> > + unsigned long tsk_util,
> > + bool to_pref)
> > +{
> > + unsigned long src_util, dst_util, src_cap, dst_cap;
> > +
> > + if (!get_llc_stats(src_cpu, &src_util, &src_cap) ||
> > + !get_llc_stats(dst_cpu, &dst_util, &dst_cap))
> > + return mig_unrestricted;
> > +
> > + if (!fits_llc_capacity(dst_util, dst_cap) &&
> > + !fits_llc_capacity(src_util, src_cap))
> > + return mig_unrestricted;
> > +
> > + src_util = src_util < tsk_util ? 0 : src_util - tsk_util;
> > + dst_util = dst_util + tsk_util;
> > + if (to_pref) {
> > + /*
> > + * llc_imb_pct is the imbalance allowed between
> > + * preferred LLC and non-preferred LLC.
> > + * Don't migrate if we will get preferred LLC too
> > + * heavily loaded and if the dest is much busier
> > + * than the src, in which case migration will
> > + * increase the imbalance too much.
> > + */
> > + if (!fits_llc_capacity(dst_util, dst_cap) &&
> > + util_greater(dst_util, src_util))
> > + return mig_forbid;
> > + } else {
> > + /*
> > + * Don't migrate if we will leave preferred LLC
> > + * too idle, or if this migration leads to the
> > + * non-preferred LLC falls within sysctl_aggr_imb percent
> > + * of preferred LLC, leading to migration again
> > + * back to preferred LLC.
> > + */
> > + if (fits_llc_capacity(src_util, src_cap) ||
> > + !util_greater(src_util, dst_util))
> > + return mig_forbid;
> > + }
> > + return mig_llc;
> > +}
> > +
> > +/*
> > + * Check if task p can migrate from source LLC to
> > + * destination LLC in terms of cache aware load balance.
> > + */
> > +static __maybe_unused enum llc_mig can_migrate_llc_task(int src_cpu, int dst_cpu,
> > + struct task_struct *p)
> > +{
> > + struct mm_struct *mm;
> > + bool to_pref;
> > + int cpu;
> > +
> > + mm = p->mm;
> > + if (!mm)
> > + return mig_unrestricted;
> > +
> > + cpu = mm->mm_sched_cpu;
> > + if (cpu < 0 || cpus_share_cache(src_cpu, dst_cpu))
> > + return mig_unrestricted;
> > +
> > + if (cpus_share_cache(dst_cpu, cpu))
> > + to_pref = true;
> > + else if (cpus_share_cache(src_cpu, cpu))
> > + to_pref = false;
> > + else
> > + return mig_unrestricted;
> > +
> > + return can_migrate_llc(src_cpu, dst_cpu,
> > + task_util(p), to_pref);
> > +}
> > +
> > #else
> > static inline bool get_llc_stats(int cpu, unsigned long *util,
> > unsigned long *cap)
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 84118b522f22..bf72c5bab506 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2828,6 +2828,11 @@ extern unsigned int sysctl_numa_balancing_scan_period_max;
> > extern unsigned int sysctl_numa_balancing_scan_size;
> > extern unsigned int sysctl_numa_balancing_hot_threshold;
> >
> > +#ifdef CONFIG_SCHED_CACHE
> > +extern unsigned int llc_overload_pct;
> > +extern unsigned int llc_imb_pct;
> > +#endif
> > +
> > #ifdef CONFIG_SCHED_HRTICK
> >
> > /*
>
Powered by blists - more mailing lists