[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250528070949.723754-1-wujianyong@hygon.cn>
Date: Wed, 28 May 2025 07:09:49 +0000
From: Jianyong Wu <wujianyong@...on.cn>
To: <mingo@...hat.com>, <peterz@...radead.org>, <juri.lelli@...hat.com>,
<vincent.guittot@...aro.org>
CC: <dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
<mgorman@...e.de>, <vschneid@...hat.com>, <linux-kernel@...r.kernel.org>,
<wujianyong@...on.cn>, <jianyong.wu@...look.com>
Subject: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
The efficiency gains from co-locating communicating tasks within the same
LLC are well-established. However, in multi-LLC NUMA systems, the load
balancer unintentionally sabotages this optimization.
Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
wakes the server within its initial LLC (e.g., LLC_0). The load balancer
subsequently migrates the client to a different LLC (e.g., LLC_1). When
the client next wakes the server, it now targets the server’s placement
to LLC_1 (the client’s new location). The server then migrates to LLC_1,
but the load balancer may reallocate the client to another
LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
perpetually chase each other across all four LLCs — a sustained
cross-LLC ping-pong within the NUMA node.
Our solution: Permit controlled load imbalance between LLCs on the same
NUMA node, prioritizing communication affinity over strict balance.
Impact: In a virtual machine with one socket, multiple NUMA nodes (each
with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
seconds as tasks cycled through all four LLCs. With the patch, migrations
stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
thrashing.
Signed-off-by: Jianyong Wu <wujianyong@...on.cn>
---
kernel/sched/fair.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..749210e6316b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
}
#endif
+ /* Allow imbalance between LLCs within a single NUMA node */
+ if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
+ && env->sd->parent->flags & SD_NUMA) {
+ int child_weight = env->sd->child->span_weight;
+ int llc_nr = env->sd->span_weight / child_weight;
+ int imb_nr, min;
+
+ if (llc_nr > 1) {
+ /* Let the imbalance not be greater than half of child_weight */
+ min = child_weight >= 4 ? 2 : 1;
+ imb_nr = max_t(int, min, child_weight >> 2);
+ if (imb_nr >= env->imbalance)
+ env->imbalance = 0;
+ }
+ }
+
/* Number of tasks to move to restore balance */
env->imbalance >>= 1;
--
2.43.0
Powered by blists - more mailing lists