linux-kernel - Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e2b79e4e-f964-4fb6-8d23-6b9d9aeb6980@amd.com>
Date: Thu, 29 May 2025 12:09:32 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Jianyong Wu <wujianyong@...on.cn>, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org,
 jianyong.wu@...look.com
Subject: Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

On 5/28/2025 12:39 PM, Jianyong Wu wrote:
> The efficiency gains from co-locating communicating tasks within the same
> LLC are well-established. However, in multi-LLC NUMA systems, the load
> balancer unintentionally sabotages this optimization.
> 
> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
> subsequently migrates the client to a different LLC (e.g., LLC_1). When
> the client next wakes the server, it now targets the server’s placement
> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
> but the load balancer may reallocate the client to another
> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
> perpetually chase each other across all four LLCs — a sustained
> cross-LLC ping-pong within the NUMA node.

Migration only happens if the CPU is overloaded right? I've only seen
this happen when a noise like kworker comes in. What exactly is
causing these migrations in your case and is it actually that bad
for iperf?

> 
> Our solution: Permit controlled load imbalance between LLCs on the same
> NUMA node, prioritizing communication affinity over strict balance.
> 
> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
> seconds as tasks cycled through all four LLCs. With the patch, migrations
> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
> thrashing.

Is there any improvement in iperf numbers with these changes?

> 
> Signed-off-by: Jianyong Wu <wujianyong@...on.cn>
> ---
>   kernel/sched/fair.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995a47..749210e6316b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>   		}
>   #endif
>   
> +		/* Allow imbalance between LLCs within a single NUMA node */
> +		if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
> +				&& env->sd->parent->flags & SD_NUMA) {

This does not imply multiple LLC in package. SD_SHARE_LLC is
SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
will be true on Intel with SNC enabled despite not having multiple LLC
and llc_nr will be number of cores there.

Perhaps multiple LLCs can be detected using:

     !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)

> +			int child_weight = env->sd->child->span_weight;
> +			int llc_nr = env->sd->span_weight / child_weight;
> +			int imb_nr, min;
> +
> +			if (llc_nr > 1) {
> +				/* Let the imbalance not be greater than half of child_weight */
> +				min = child_weight >= 4 ? 2 : 1;
> +				imb_nr = max_t(int, min, child_weight >> 2);

Isn't this just max_t(int, child_weight >> 2, 1)?

> +				if (imb_nr >= env->imbalance)
> +					env->imbalance = 0;

At this point, we are trying to even out the number of idle CPUs on the
destination and the busiest LLC. sched_balance_find_src_rq() will return
NULL if it doesn't find an overloaded rq. Is waiting behind a task
more beneficial than migrating to an idler LLC?

> +			}
> +		}
> +
>   		/* Number of tasks to move to restore balance */
>   		env->imbalance >>= 1;
>   

-- 
Thanks and Regards,
Prateek