linux-kernel - Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SI2PR04MB49315B9161865FB1D319EEDAE361A@SI2PR04MB4931.apcprd04.prod.outlook.com>
Date: Fri, 30 May 2025 15:36:43 +0800
From: Jianyong Wu <jianyong.wu@...look.com>
To: K Prateek Nayak <kprateek.nayak@....com>,
 Jianyong Wu <wujianyong@...on.cn>, mingo@...hat.com, peterz@...radead.org,
 juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

Hi Prateek,

On 5/30/2025 2:09 PM, K Prateek Nayak wrote:
> Hello Jianyong,
> 
> On 5/29/2025 4:02 PM, Jianyong Wu wrote:
>>
>> This will happen even when 2 task are located in a cpuset of 16 cpus 
>> that shares an LLC. I don't think that it's overloaded for this case.
> 
> But if they are located on 2 different CPUs, sched_balance_find_src_rq()
> should not return any CPU right? Probably just a timing thing with some
> system noise that causes the CPU running the server / client to be
> temporarily overloaded.
> 

I think it will.
Suppose this scenario. There are 2 LLCs each associated with 4 cpus, 
under a single NUMA node. Sometimes, one LLC has 4 tasks, each running 
on a separate cpu, while the other LLC has no task running. Should the 
load balancer take action to balance the workload? Absolutely yes.

The tricky point is that the balance action will fail during the first 
few attempts. In the meanwhile, sd->nr_balance_failed increments until 
it exceeds the threshold sd->cache_nice_tries + 2. At that point, active 
balancing is triggered. Eventually, the migration thread steps in to 
migrate the task. That's exactly what I've observed.

>>
>>   I've only seen
>>> this happen when a noise like kworker comes in. What exactly is
>>> causing these migrations in your case and is it actually that bad
>>> for iperf?
>>
>> I think it's the nohz idle balance that pulls these 2 iperf apart. But 
>> the root cause is that load balance doesn't permit even a slight 
>> imbalance among LLCs.
>>
>> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like 
>> some AMD servers.
>>
>>>
>>>>
>>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>>> NUMA node, prioritizing communication affinity over strict balance.
>>>>
>>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>>> seconds as tasks cycled through all four LLCs. With the patch, 
>>>> migrations
>>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>>> thrashing.
>>>
>>> Is there any improvement in iperf numbers with these changes?
>>>
>> I observe a bit of improvement with this patch in my test.
> 
> I'll also give this series a spin on my end to see if it helps.
> 
Great! Let me know how it goes on your end.

>>
>>>>
>>>> Signed-off-by: Jianyong Wu <wujianyong@...on.cn>
>>>> ---
>>>>   kernel/sched/fair.c | 16 ++++++++++++++++
>>>>   1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 0fb9bf995a47..749210e6316b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -11203,6 +11203,22 @@ static inline void 
>>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>>           }
>>>>   #endif
>>>> +        /* Allow imbalance between LLCs within a single NUMA node */
>>>> +        if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC 
>>>> && env->sd->parent
>>>> +                && env->sd->parent->flags & SD_NUMA) {
>>>
>>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>>> will be true on Intel with SNC enabled despite not having multiple LLC
>>> and llc_nr will be number of cores there.
>>>
>>> Perhaps multiple LLCs can be detected using:
>>>
>>>      !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
> 
> This should have been just
> 
>      (sd->child->flags ^ sd->flags) & SD_SHARE_LLC
> 
> to find the LLC boundary. Not sure why I prefixed that "!". You also
> have to ensure sd itself is not a NUMA domain which is possible with L3
> as NUMA option EPYC platforms and Intel with SNC.
>
Thanks again, I made a mistake too.

Thanks
Jianyong
>>
>> Great! Thanks!>
>>>> +            int child_weight = env->sd->child->span_weight;
>>>> +            int llc_nr = env->sd->span_weight / child_weight;
>>>> +            int imb_nr, min;
>>>> +
>>>> +            if (llc_nr > 1) {
>>>> +                /* Let the imbalance not be greater than half of 
>>>> child_weight */
>>>> +                min = child_weight >= 4 ? 2 : 1;
>>>> +                imb_nr = max_t(int, min, child_weight >> 2);
>>>
>>> Isn't this just max_t(int, child_weight >> 2, 1)?
>>
>> I expect that imb_nr can be 2 when child_weight is 4, as I observe 
>> that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
>> However, this may cause the LLCs a bit overload. I'm not sure if it's 
>> a good idea.
> 
> My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
> working moments.
>