linux-kernel - Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SI2PR04MB49310190973DC859BBE05DE2E366A@SI2PR04MB4931.apcprd04.prod.outlook.com>
Date: Thu, 29 May 2025 18:32:01 +0800
From: Jianyong Wu <jianyong.wu@...look.com>
To: K Prateek Nayak <kprateek.nayak@....com>,
 Jianyong Wu <wujianyong@...on.cn>, mingo@...hat.com, peterz@...radead.org,
 juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

Hello K Prateek Nayak, thanks for reply.

On 5/29/2025 2:39 PM, K Prateek Nayak wrote:
> On 5/28/2025 12:39 PM, Jianyong Wu wrote:
>> The efficiency gains from co-locating communicating tasks within the same
>> LLC are well-established. However, in multi-LLC NUMA systems, the load
>> balancer unintentionally sabotages this optimization.
>>
>> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
>> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
>> subsequently migrates the client to a different LLC (e.g., LLC_1). When
>> the client next wakes the server, it now targets the server’s placement
>> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
>> but the load balancer may reallocate the client to another
>> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
>> perpetually chase each other across all four LLCs — a sustained
>> cross-LLC ping-pong within the NUMA node.
> 
> Migration only happens if the CPU is overloaded right?

This will happen even when 2 task are located in a cpuset of 16 cpus 
that shares an LLC. I don't think that it's overloaded for this case.

  I've only seen
> this happen when a noise like kworker comes in. What exactly is
> causing these migrations in your case and is it actually that bad
> for iperf?

I think it's the nohz idle balance that pulls these 2 iperf apart. But 
the root cause is that load balance doesn't permit even a slight 
imbalance among LLCs.

Exactly. It's easy to reproduce in those multi-LLCs NUMA system like 
some AMD servers.

> 
>>
>> Our solution: Permit controlled load imbalance between LLCs on the same
>> NUMA node, prioritizing communication affinity over strict balance.
>>
>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>> seconds as tasks cycled through all four LLCs. With the patch, migrations
>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>> thrashing.
> 
> Is there any improvement in iperf numbers with these changes?
> 
I observe a bit of improvement with this patch in my test.

>>
>> Signed-off-by: Jianyong Wu <wujianyong@...on.cn>
>> ---
>>   kernel/sched/fair.c | 16 ++++++++++++++++
>>   1 file changed, 16 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0fb9bf995a47..749210e6316b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct 
>> lb_env *env, struct sd_lb_stats *s
>>           }
>>   #endif
>> +        /* Allow imbalance between LLCs within a single NUMA node */
>> +        if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && 
>> env->sd->parent
>> +                && env->sd->parent->flags & SD_NUMA) {
> 
> This does not imply multiple LLC in package. SD_SHARE_LLC is
> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
> will be true on Intel with SNC enabled despite not having multiple LLC
> and llc_nr will be number of cores there.
> 
> Perhaps multiple LLCs can be detected using:
> 
>      !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)

Great! Thanks!>
>> +            int child_weight = env->sd->child->span_weight;
>> +            int llc_nr = env->sd->span_weight / child_weight;
>> +            int imb_nr, min;
>> +
>> +            if (llc_nr > 1) {
>> +                /* Let the imbalance not be greater than half of 
>> child_weight */
>> +                min = child_weight >= 4 ? 2 : 1;
>> +                imb_nr = max_t(int, min, child_weight >> 2);
> 
> Isn't this just max_t(int, child_weight >> 2, 1)?

I expect that imb_nr can be 2 when child_weight is 4, as I observe that 
the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
However, this may cause the LLCs a bit overload. I'm not sure if it's a 
good idea.

> 
>> +                if (imb_nr >= env->imbalance)
>> +                    env->imbalance = 0;
> 
> At this point, we are trying to even out the number of idle CPUs on the
> destination and the busiest LLC. sched_balance_find_src_rq() will return
> NULL if it doesn't find an overloaded rq. Is waiting behind a task
> more beneficial than migrating to an idler LLC?
> 
It seems that a small imbalance may not impact so much that cause task 
waiting for schedule because we limit the imbalance not greater than 
half, in most case 1/4, of the LLC weight. The imbalance can reduce the 
frequency of task migration and load balance. it's better than enforcing 
a strict balance rules.
we have done similar things among NUMAs, so may be it's reasonable to 
migrate them across LLCs.

Thanks
Jianyong Wu

>> +            }
>> +        }
>> +
>>           /* Number of tasks to move to restore balance */
>>           env->imbalance >>= 1;
>