linux-kernel - Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <db88ce98-cc24-4697-a744-01c478b7f5c8@amd.com>
Date: Fri, 30 May 2025 11:39:12 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Jianyong Wu <jianyong.wu@...look.com>, Jianyong Wu <wujianyong@...on.cn>,
 mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
 vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

Hello Jianyong,

On 5/29/2025 4:02 PM, Jianyong Wu wrote:
> 
> This will happen even when 2 task are located in a cpuset of 16 cpus that shares an LLC. I don't think that it's overloaded for this case.

But if they are located on 2 different CPUs, sched_balance_find_src_rq()
should not return any CPU right? Probably just a timing thing with some
system noise that causes the CPU running the server / client to be
temporarily overloaded.

> 
>   I've only seen
>> this happen when a noise like kworker comes in. What exactly is
>> causing these migrations in your case and is it actually that bad
>> for iperf?
> 
> I think it's the nohz idle balance that pulls these 2 iperf apart. But the root cause is that load balance doesn't permit even a slight imbalance among LLCs.
> 
> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like some AMD servers.
> 
>>
>>>
>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>> NUMA node, prioritizing communication affinity over strict balance.
>>>
>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>> seconds as tasks cycled through all four LLCs. With the patch, migrations
>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>> thrashing.
>>
>> Is there any improvement in iperf numbers with these changes?
>>
> I observe a bit of improvement with this patch in my test.

I'll also give this series a spin on my end to see if it helps.

> 
>>>
>>> Signed-off-by: Jianyong Wu <wujianyong@...on.cn>
>>> ---
>>>   kernel/sched/fair.c | 16 ++++++++++++++++
>>>   1 file changed, 16 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 0fb9bf995a47..749210e6316b 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>           }
>>>   #endif
>>> +        /* Allow imbalance between LLCs within a single NUMA node */
>>> +        if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
>>> +                && env->sd->parent->flags & SD_NUMA) {
>>
>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>> will be true on Intel with SNC enabled despite not having multiple LLC
>> and llc_nr will be number of cores there.
>>
>> Perhaps multiple LLCs can be detected using:
>>
>>      !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)

This should have been just

     (sd->child->flags ^ sd->flags) & SD_SHARE_LLC

to find the LLC boundary. Not sure why I prefixed that "!". You also
have to ensure sd itself is not a NUMA domain which is possible with L3
as NUMA option EPYC platforms and Intel with SNC.

> 
> Great! Thanks!>
>>> +            int child_weight = env->sd->child->span_weight;
>>> +            int llc_nr = env->sd->span_weight / child_weight;
>>> +            int imb_nr, min;
>>> +
>>> +            if (llc_nr > 1) {
>>> +                /* Let the imbalance not be greater than half of child_weight */
>>> +                min = child_weight >= 4 ? 2 : 1;
>>> +                imb_nr = max_t(int, min, child_weight >> 2);
>>
>> Isn't this just max_t(int, child_weight >> 2, 1)?
> 
> I expect that imb_nr can be 2 when child_weight is 4, as I observe that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
> However, this may cause the LLCs a bit overload. I'm not sure if it's a good idea.

My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
working moments.

-- 
Thanks and Regards,
Prateek