[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SI2PR04MB49310190973DC859BBE05DE2E366A@SI2PR04MB4931.apcprd04.prod.outlook.com>
Date: Thu, 29 May 2025 18:32:01 +0800
From: Jianyong Wu <jianyong.wu@...look.com>
To: K Prateek Nayak <kprateek.nayak@....com>,
Jianyong Wu <wujianyong@...on.cn>, mingo@...hat.com, peterz@...radead.org,
juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
Hello K Prateek Nayak, thanks for reply.
On 5/29/2025 2:39 PM, K Prateek Nayak wrote:
> On 5/28/2025 12:39 PM, Jianyong Wu wrote:
>> The efficiency gains from co-locating communicating tasks within the same
>> LLC are well-established. However, in multi-LLC NUMA systems, the load
>> balancer unintentionally sabotages this optimization.
>>
>> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
>> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
>> subsequently migrates the client to a different LLC (e.g., LLC_1). When
>> the client next wakes the server, it now targets the server’s placement
>> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
>> but the load balancer may reallocate the client to another
>> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
>> perpetually chase each other across all four LLCs — a sustained
>> cross-LLC ping-pong within the NUMA node.
>
> Migration only happens if the CPU is overloaded right?
This will happen even when 2 task are located in a cpuset of 16 cpus
that shares an LLC. I don't think that it's overloaded for this case.
I've only seen
> this happen when a noise like kworker comes in. What exactly is
> causing these migrations in your case and is it actually that bad
> for iperf?
I think it's the nohz idle balance that pulls these 2 iperf apart. But
the root cause is that load balance doesn't permit even a slight
imbalance among LLCs.
Exactly. It's easy to reproduce in those multi-LLCs NUMA system like
some AMD servers.
>
>>
>> Our solution: Permit controlled load imbalance between LLCs on the same
>> NUMA node, prioritizing communication affinity over strict balance.
>>
>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>> seconds as tasks cycled through all four LLCs. With the patch, migrations
>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>> thrashing.
>
> Is there any improvement in iperf numbers with these changes?
>
I observe a bit of improvement with this patch in my test.
>>
>> Signed-off-by: Jianyong Wu <wujianyong@...on.cn>
>> ---
>> kernel/sched/fair.c | 16 ++++++++++++++++
>> 1 file changed, 16 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0fb9bf995a47..749210e6316b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct
>> lb_env *env, struct sd_lb_stats *s
>> }
>> #endif
>> + /* Allow imbalance between LLCs within a single NUMA node */
>> + if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC &&
>> env->sd->parent
>> + && env->sd->parent->flags & SD_NUMA) {
>
> This does not imply multiple LLC in package. SD_SHARE_LLC is
> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
> will be true on Intel with SNC enabled despite not having multiple LLC
> and llc_nr will be number of cores there.
>
> Perhaps multiple LLCs can be detected using:
>
> !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
Great! Thanks!>
>> + int child_weight = env->sd->child->span_weight;
>> + int llc_nr = env->sd->span_weight / child_weight;
>> + int imb_nr, min;
>> +
>> + if (llc_nr > 1) {
>> + /* Let the imbalance not be greater than half of
>> child_weight */
>> + min = child_weight >= 4 ? 2 : 1;
>> + imb_nr = max_t(int, min, child_weight >> 2);
>
> Isn't this just max_t(int, child_weight >> 2, 1)?
I expect that imb_nr can be 2 when child_weight is 4, as I observe that
the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
However, this may cause the LLCs a bit overload. I'm not sure if it's a
good idea.
>
>> + if (imb_nr >= env->imbalance)
>> + env->imbalance = 0;
>
> At this point, we are trying to even out the number of idle CPUs on the
> destination and the busiest LLC. sched_balance_find_src_rq() will return
> NULL if it doesn't find an overloaded rq. Is waiting behind a task
> more beneficial than migrating to an idler LLC?
>
It seems that a small imbalance may not impact so much that cause task
waiting for schedule because we limit the imbalance not greater than
half, in most case 1/4, of the LLC weight. The imbalance can reduce the
frequency of task migration and load balance. it's better than enforcing
a strict balance rules.
we have done similar things among NUMAs, so may be it's reasonable to
migrate them across LLCs.
Thanks
Jianyong Wu
>> + }
>> + }
>> +
>> /* Number of tasks to move to restore balance */
>> env->imbalance >>= 1;
>
Powered by blists - more mailing lists