[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e2b79e4e-f964-4fb6-8d23-6b9d9aeb6980@amd.com>
Date: Thu, 29 May 2025 12:09:32 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Jianyong Wu <wujianyong@...on.cn>, mingo@...hat.com,
peterz@...radead.org, juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
jianyong.wu@...look.com
Subject: Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
On 5/28/2025 12:39 PM, Jianyong Wu wrote:
> The efficiency gains from co-locating communicating tasks within the same
> LLC are well-established. However, in multi-LLC NUMA systems, the load
> balancer unintentionally sabotages this optimization.
>
> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
> subsequently migrates the client to a different LLC (e.g., LLC_1). When
> the client next wakes the server, it now targets the server’s placement
> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
> but the load balancer may reallocate the client to another
> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
> perpetually chase each other across all four LLCs — a sustained
> cross-LLC ping-pong within the NUMA node.
Migration only happens if the CPU is overloaded right? I've only seen
this happen when a noise like kworker comes in. What exactly is
causing these migrations in your case and is it actually that bad
for iperf?
>
> Our solution: Permit controlled load imbalance between LLCs on the same
> NUMA node, prioritizing communication affinity over strict balance.
>
> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
> seconds as tasks cycled through all four LLCs. With the patch, migrations
> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
> thrashing.
Is there any improvement in iperf numbers with these changes?
>
> Signed-off-by: Jianyong Wu <wujianyong@...on.cn>
> ---
> kernel/sched/fair.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995a47..749210e6316b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> }
> #endif
>
> + /* Allow imbalance between LLCs within a single NUMA node */
> + if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
> + && env->sd->parent->flags & SD_NUMA) {
This does not imply multiple LLC in package. SD_SHARE_LLC is
SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
will be true on Intel with SNC enabled despite not having multiple LLC
and llc_nr will be number of cores there.
Perhaps multiple LLCs can be detected using:
!((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
> + int child_weight = env->sd->child->span_weight;
> + int llc_nr = env->sd->span_weight / child_weight;
> + int imb_nr, min;
> +
> + if (llc_nr > 1) {
> + /* Let the imbalance not be greater than half of child_weight */
> + min = child_weight >= 4 ? 2 : 1;
> + imb_nr = max_t(int, min, child_weight >> 2);
Isn't this just max_t(int, child_weight >> 2, 1)?
> + if (imb_nr >= env->imbalance)
> + env->imbalance = 0;
At this point, we are trying to even out the number of idle CPUs on the
destination and the busiest LLC. sched_balance_find_src_rq() will return
NULL if it doesn't find an overloaded rq. Is waiting behind a task
more beneficial than migrating to an idler LLC?
> + }
> + }
> +
> /* Number of tasks to move to restore balance */
> env->imbalance >>= 1;
>
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists