linux-kernel - Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID:
 <SI2PR04MB4931A23ABF08616FD8A133D6E370A@SI2PR04MB4931.apcprd04.prod.outlook.com>
Date: Mon, 16 Jun 2025 10:22:04 +0800
From: Jianyong Wu <jianyong.wu@...look.com>
To: K Prateek Nayak <kprateek.nayak@....com>,
 Jianyong Wu <wujianyong@...on.cn>, mingo@...hat.com, peterz@...radead.org,
 juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

Hi Prateek,

On 5/30/2025 2:09 PM, K Prateek Nayak wrote:
> Hello Jianyong,
> 
> On 5/29/2025 4:02 PM, Jianyong Wu wrote:
>>
>> This will happen even when 2 task are located in a cpuset of 16 cpus 
>> that shares an LLC. I don't think that it's overloaded for this case.
> 
> But if they are located on 2 different CPUs, sched_balance_find_src_rq()
> should not return any CPU right? Probably just a timing thing with some
> system noise that causes the CPU running the server / client to be
> temporarily overloaded.
> 
>>
>>   I've only seen
>>> this happen when a noise like kworker comes in. What exactly is
>>> causing these migrations in your case and is it actually that bad
>>> for iperf?
>>
>> I think it's the nohz idle balance that pulls these 2 iperf apart. But 
>> the root cause is that load balance doesn't permit even a slight 
>> imbalance among LLCs.
>>
>> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like 
>> some AMD servers.
>>
>>>
>>>>
>>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>>> NUMA node, prioritizing communication affinity over strict balance.
>>>>
>>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>>> seconds as tasks cycled through all four LLCs. With the patch, 
>>>> migrations
>>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>>> thrashing.
>>>
>>> Is there any improvement in iperf numbers with these changes?
>>>
>> I observe a bit of improvement with this patch in my test.
> 
> I'll also give this series a spin on my end to see if it helps.

Would you mind letting me know if you've had a chance to try it out, or 
if there's any update on the progress?

Thanks
Jianyong>
>>
>>>>
>>>> Signed-off-by: Jianyong Wu <wujianyong@...on.cn>
>>>> ---
>>>>   kernel/sched/fair.c | 16 ++++++++++++++++
>>>>   1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 0fb9bf995a47..749210e6316b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -11203,6 +11203,22 @@ static inline void 
>>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>>           }
>>>>   #endif
>>>> +        /* Allow imbalance between LLCs within a single NUMA node */
>>>> +        if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC 
>>>> && env->sd->parent
>>>> +                && env->sd->parent->flags & SD_NUMA) {
>>>
>>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>>> will be true on Intel with SNC enabled despite not having multiple LLC
>>> and llc_nr will be number of cores there.
>>>
>>> Perhaps multiple LLCs can be detected using:
>>>
>>>      !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
> 
> This should have been just
> 
>      (sd->child->flags ^ sd->flags) & SD_SHARE_LLC
> 
> to find the LLC boundary. Not sure why I prefixed that "!". You also
> have to ensure sd itself is not a NUMA domain which is possible with L3
> as NUMA option EPYC platforms and Intel with SNC.
> 
>>
>> Great! Thanks!>
>>>> +            int child_weight = env->sd->child->span_weight;
>>>> +            int llc_nr = env->sd->span_weight / child_weight;
>>>> +            int imb_nr, min;
>>>> +
>>>> +            if (llc_nr > 1) {
>>>> +                /* Let the imbalance not be greater than half of 
>>>> child_weight */
>>>> +                min = child_weight >= 4 ? 2 : 1;
>>>> +                imb_nr = max_t(int, min, child_weight >> 2);
>>>
>>> Isn't this just max_t(int, child_weight >> 2, 1)?
>>
>> I expect that imb_nr can be 2 when child_weight is 4, as I observe 
>> that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
>> However, this may cause the LLCs a bit overload. I'm not sure if it's 
>> a good idea.
> 
> My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
> working moments.
>