[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SI2PR04MB4931A23ABF08616FD8A133D6E370A@SI2PR04MB4931.apcprd04.prod.outlook.com>
Date: Mon, 16 Jun 2025 10:22:04 +0800
From: Jianyong Wu <jianyong.wu@...look.com>
To: K Prateek Nayak <kprateek.nayak@....com>,
Jianyong Wu <wujianyong@...on.cn>, mingo@...hat.com, peterz@...radead.org,
juri.lelli@...hat.com, vincent.guittot@...aro.org
Cc: dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA
Hi Prateek,
On 5/30/2025 2:09 PM, K Prateek Nayak wrote:
> Hello Jianyong,
>
> On 5/29/2025 4:02 PM, Jianyong Wu wrote:
>>
>> This will happen even when 2 task are located in a cpuset of 16 cpus
>> that shares an LLC. I don't think that it's overloaded for this case.
>
> But if they are located on 2 different CPUs, sched_balance_find_src_rq()
> should not return any CPU right? Probably just a timing thing with some
> system noise that causes the CPU running the server / client to be
> temporarily overloaded.
>
>>
>> I've only seen
>>> this happen when a noise like kworker comes in. What exactly is
>>> causing these migrations in your case and is it actually that bad
>>> for iperf?
>>
>> I think it's the nohz idle balance that pulls these 2 iperf apart. But
>> the root cause is that load balance doesn't permit even a slight
>> imbalance among LLCs.
>>
>> Exactly. It's easy to reproduce in those multi-LLCs NUMA system like
>> some AMD servers.
>>
>>>
>>>>
>>>> Our solution: Permit controlled load imbalance between LLCs on the same
>>>> NUMA node, prioritizing communication affinity over strict balance.
>>>>
>>>> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
>>>> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
>>>> seconds as tasks cycled through all four LLCs. With the patch,
>>>> migrations
>>>> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
>>>> thrashing.
>>>
>>> Is there any improvement in iperf numbers with these changes?
>>>
>> I observe a bit of improvement with this patch in my test.
>
> I'll also give this series a spin on my end to see if it helps.
Would you mind letting me know if you've had a chance to try it out, or
if there's any update on the progress?
Thanks
Jianyong>
>>
>>>>
>>>> Signed-off-by: Jianyong Wu <wujianyong@...on.cn>
>>>> ---
>>>> kernel/sched/fair.c | 16 ++++++++++++++++
>>>> 1 file changed, 16 insertions(+)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index 0fb9bf995a47..749210e6316b 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -11203,6 +11203,22 @@ static inline void
>>>> calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>>> }
>>>> #endif
>>>> + /* Allow imbalance between LLCs within a single NUMA node */
>>>> + if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC
>>>> && env->sd->parent
>>>> + && env->sd->parent->flags & SD_NUMA) {
>>>
>>> This does not imply multiple LLC in package. SD_SHARE_LLC is
>>> SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
>>> will be true on Intel with SNC enabled despite not having multiple LLC
>>> and llc_nr will be number of cores there.
>>>
>>> Perhaps multiple LLCs can be detected using:
>>>
>>> !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
>
> This should have been just
>
> (sd->child->flags ^ sd->flags) & SD_SHARE_LLC
>
> to find the LLC boundary. Not sure why I prefixed that "!". You also
> have to ensure sd itself is not a NUMA domain which is possible with L3
> as NUMA option EPYC platforms and Intel with SNC.
>
>>
>> Great! Thanks!>
>>>> + int child_weight = env->sd->child->span_weight;
>>>> + int llc_nr = env->sd->span_weight / child_weight;
>>>> + int imb_nr, min;
>>>> +
>>>> + if (llc_nr > 1) {
>>>> + /* Let the imbalance not be greater than half of
>>>> child_weight */
>>>> + min = child_weight >= 4 ? 2 : 1;
>>>> + imb_nr = max_t(int, min, child_weight >> 2);
>>>
>>> Isn't this just max_t(int, child_weight >> 2, 1)?
>>
>> I expect that imb_nr can be 2 when child_weight is 4, as I observe
>> that the cpu number of LLC starts from 4 in the multi-LLCs NUMA system.
>> However, this may cause the LLCs a bit overload. I'm not sure if it's
>> a good idea.
>
> My bad. I interpreted ">> 2" as "/ 2" here. Couple of brain stopped
> working moments.
>
Powered by blists - more mailing lists