[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e4eb39e8-0391-4152-9e25-daf4b47bfc02@amd.com>
Date: Mon, 21 Jul 2025 09:58:12 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Leon Romanovsky <leon@...nel.org>
CC: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Valentin Schneider <vschneid@...hat.com>,
<linux-kernel@...r.kernel.org>, Steve Wahl <steve.wahl@....com>, "Borislav
Petkov" <bp@...en8.de>, Dietmar Eggemann <dietmar.eggemann@....com>, "Steven
Rostedt" <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
<mgorman@...e.de>
Subject: Re: [PATCH v4] sched/fair: Use sched_domain_span() for
topology_span_sane()
Hello Leon,
On 7/20/2025 4:11 PM, Leon Romanovsky wrote:
> On Wed, Jul 09, 2025 at 04:19:17PM +0000, K Prateek Nayak wrote:
>> Leon noted a topology_span_sane() warning in their guest deployment
>> starting from v6.16-rc1 [1]. Debug that followed pointed to the
>> tl->mask() for the NODE domain being incorrectly resolved to that of the
>> highest NUMA domain.
>>
>> tl->mask() for NODE is set to the sd_numa_mask() which depends on the
>> global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
>> set to the "tl->numa_level" during tl traversal in build_sched_domains()
>> calling sd_init() but was not reset before topology_span_sane().
>>
>> Since "tl->numa_level" still reflected the old value from
>> build_sched_domains(), topology_span_sane() for the NODE domain trips
>> when the span of the last NUMA domain overlaps.
>>
>> Instead of replicating the "sched_domains_curr_level" hack, Valentin
>> suggested using the spans from the sched_domain objects constructed
>> during build_sched_domains() which can also catch overlaps when the
>> domain spans are fixed up by build_sched_domain().
>>
>> Since build_sched_domain() is skipped when tl->mask() of a child domain
>> already covers the entire cpumap, skip the domains that have an empty
>> span.
>>
>> The original warning was reproducible on the following NUMA topology
>> reported by Leon:
>>
>> $ sudo numactl -H
>> available: 5 nodes (0-4)
>> node 0 cpus: 0 1
>> node 0 size: 2927 MB
>> node 0 free: 1603 MB
>> node 1 cpus: 2 3
>> node 1 size: 3023 MB
>> node 1 free: 3008 MB
>> node 2 cpus: 4 5
>> node 2 size: 3023 MB
>> node 2 free: 3007 MB
>> node 3 cpus: 6 7
>> node 3 size: 3023 MB
>> node 3 free: 3002 MB
>> node 4 cpus: 8 9
>> node 4 size: 3022 MB
>> node 4 free: 2718 MB
>> node distances:
>> node 0 1 2 3 4
>> 0: 10 39 38 37 36
>> 1: 39 10 38 37 36
>> 2: 38 38 10 37 36
>> 3: 37 37 37 10 36
>> 4: 36 36 36 36 10
>>
>> The above topology can be mimicked using the following QEMU cmd that was
>> used to reproduce the warning and test the fix:
>>
>> sudo qemu-system-x86_64 -enable-kvm -cpu host \
>> -m 20G -smp cpus=10,sockets=10 -machine q35 \
>> -object memory-backend-ram,size=4G,id=m0 \
>> -object memory-backend-ram,size=4G,id=m1 \
>> -object memory-backend-ram,size=4G,id=m2 \
>> -object memory-backend-ram,size=4G,id=m3 \
>> -object memory-backend-ram,size=4G,id=m4 \
>> -numa node,cpus=0-1,memdev=m0,nodeid=0 \
>> -numa node,cpus=2-3,memdev=m1,nodeid=1 \
>> -numa node,cpus=4-5,memdev=m2,nodeid=2 \
>> -numa node,cpus=6-7,memdev=m3,nodeid=3 \
>> -numa node,cpus=8-9,memdev=m4,nodeid=4 \
>> -numa dist,src=0,dst=1,val=39 \
>> -numa dist,src=0,dst=2,val=38 \
>> -numa dist,src=0,dst=3,val=37 \
>> -numa dist,src=0,dst=4,val=36 \
>> -numa dist,src=1,dst=0,val=39 \
>> -numa dist,src=1,dst=2,val=38 \
>> -numa dist,src=1,dst=3,val=37 \
>> -numa dist,src=1,dst=4,val=36 \
>> -numa dist,src=2,dst=0,val=38 \
>> -numa dist,src=2,dst=1,val=38 \
>> -numa dist,src=2,dst=3,val=37 \
>> -numa dist,src=2,dst=4,val=36 \
>> -numa dist,src=3,dst=0,val=37 \
>> -numa dist,src=3,dst=1,val=37 \
>> -numa dist,src=3,dst=2,val=37 \
>> -numa dist,src=3,dst=4,val=36 \
>> -numa dist,src=4,dst=0,val=36 \
>> -numa dist,src=4,dst=1,val=36 \
>> -numa dist,src=4,dst=2,val=36 \
>> -numa dist,src=4,dst=3,val=36 \
>> ...
>>
>> Suggested-by: Valentin Schneider <vschneid@...hat.com>
>> Reported-by: Leon Romanovsky <leon@...nel.org>
>> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
>> Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
>> Reviewed-by: Steve Wahl <steve.wahl@....com>
>> Tested-by: Valentin Schneider <vschneid@...hat.com>
>> Reviewed-by: Valentin Schneider <vschneid@...hat.com>
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
>> ---
>> Changes are based on tip:sched/urgent at commit fc975cfb3639
>> ("sched/deadline: Fix dl_server runtime calculation formula")
>
> Was this patch picked?
Not yet. I think Peter was planning to pick it up as v6.17 material.
P.S. The latest version v5 can be found at
https://lore.kernel.org/lkml/20250715040824.893-1-kprateek.nayak@amd.com/
It is basically v4 but rebased on top of tip:sched/core to resolve
conflicts with a recent cleanup.
>
> Thanks,
> Tested-by: Leon Romanovsky <leon@...nel.org>
Thanks a ton!
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists