[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhy0tfhm5e.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Wed, 25 Jun 2025 17:32:45 +0200
From: Valentin Schneider <vschneid@...hat.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Ingo Molnar
<mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Juri Lelli
<juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Leon Romanovsky <leon@...nel.org>, linux-kernel@...r.kernel.org
Cc: Steve Wahl <steve.wahl@....com>, Dietmar Eggemann
<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Madadi Vineeth
Reddy <vineethr@...ux.ibm.com>, K
Prateek Nayak <kprateek.nayak@....com>
Subject: Re: [PATCH] sched/topology: Correct "sched_domains_curr_level" in
topology_span_sane()
Hey,
First of all, thanks for looking into this!
On 24/06/25 04:12, K Prateek Nayak wrote:
> The updated topology_span_sane() algorithm in commit ce29a7da84cd
> ("sched/topology: Refinement to topology_span_sane speedup") works on
> the "sched_domain_topology_level" hierarchy to detect overlap in
> !SDTL_OVERLAP domains using the tl->mask() as opposed to the
> sched_domain hierarchy (and the sched_domain_span()) in the previous
> approach.
>
The previous approach also used tl->mask() directly, but it happened
to be called *before* the build_sched_domain() loop, so the NODE iteration
happened with sched_domain_curr_level at its default static value of
0... For the first SD build that is, I assume that was then broken for any
subsequent rebuild.
> For NODE domain, the cpumask retunred by tl->mask() depends on the
> "sched_domains_curr_level". Unless the "sched_domains_curr_level" is
> reset during topology_span_sane(), it reflects the "numa_level" of the
> last sched_domain created which is incorrect.
>
> Reset the "sched_domains_curr_level" to the "tl->numa_level" during
> topology_span_sane(). Although setting "topology_span_sane" to 0 in
> topology_span_sane() should be enough since all domains with
> numa_level > 0 currently set SDTL_OVERLAP flag, setting it to
> "tl->numa_level" makes it more explicit and future proof against changes
> in the same area.
>
> Cc: Steve Wahl <steve.wahl@....com>
> Reported-by: Leon Romanovsky <leon@...nel.org>
> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/
> Fixes: ce29a7da84cd ("sched/topology: Refinement to topology_span_sane speedup")
Per the above, this could probably be:
Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> ---
> This issue can be reproduced on a setup with the following NUMA topology
> shared by Leon:
>
> $ sudo numactl -H
> available: 5 nodes (0-4)
> node 0 cpus: 0 1
> node 0 size: 2927 MB
> node 0 free: 1603 MB
> node 1 cpus: 2 3
> node 1 size: 3023 MB
> node 1 free: 3008 MB
> node 2 cpus: 4 5
> node 2 size: 3023 MB
> node 2 free: 3007 MB
> node 3 cpus: 6 7
> node 3 size: 3023 MB
> node 3 free: 3002 MB
> node 4 cpus: 8 9
> node 4 size: 3022 MB
> node 4 free: 2718 MB
> node distances:
> node 0 1 2 3 4
> 0: 10 39 38 37 36
> 1: 39 10 38 37 36
> 2: 38 38 10 37 36
> 3: 37 37 37 10 36
> 4: 36 36 36 36 10
>
>
> This topology can be emulated using QEMU with the following cmdline used
> in my testing:
>
> sudo ~/dev/qemu/build/qemu-system-x86_64 -enable-kvm \
> -cpu host \
> -m 20G -smp cpus=10,sockets=10 -machine q35 \
> -object memory-backend-ram,size=4G,id=m0 \
> -object memory-backend-ram,size=4G,id=m1 \
> -object memory-backend-ram,size=4G,id=m2 \
> -object memory-backend-ram,size=4G,id=m3 \
> -object memory-backend-ram,size=4G,id=m4 \
> -numa node,cpus=0-1,memdev=m0,nodeid=0 \
> -numa node,cpus=2-3,memdev=m1,nodeid=1 \
> -numa node,cpus=4-5,memdev=m2,nodeid=2 \
> -numa node,cpus=6-7,memdev=m3,nodeid=3 \
> -numa node,cpus=8-9,memdev=m4,nodeid=4 \
> -numa dist,src=0,dst=1,val=39 \
> -numa dist,src=0,dst=2,val=38 \
> -numa dist,src=0,dst=3,val=37 \
> -numa dist,src=0,dst=4,val=36 \
> -numa dist,src=1,dst=0,val=39 \
> -numa dist,src=1,dst=2,val=38 \
> -numa dist,src=1,dst=3,val=37 \
> -numa dist,src=1,dst=4,val=36 \
> -numa dist,src=2,dst=0,val=38 \
> -numa dist,src=2,dst=1,val=38 \
> -numa dist,src=2,dst=3,val=37 \
> -numa dist,src=2,dst=4,val=36 \
> -numa dist,src=3,dst=0,val=37 \
> -numa dist,src=3,dst=1,val=37 \
> -numa dist,src=3,dst=2,val=37 \
> -numa dist,src=3,dst=4,val=36 \
> -numa dist,src=4,dst=0,val=36 \
> -numa dist,src=4,dst=1,val=36 \
> -numa dist,src=4,dst=2,val=36 \
> -numa dist,src=4,dst=3,val=36 \
> ...
>
It's a bit of a mouthful but I would keep that in the changelog itself
given that it's part of reproducing the bug.
>
> Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag
> 'x86-build-2025-05-25' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> ---
> kernel/sched/topology.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index a2a38e1b6f18..1d634862c8df 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2426,6 +2426,15 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
> cpumask_clear(covered);
> cpumask_clear(id_seen);
>
> +#ifdef CONFIG_NUMA
> + /*
> + * Reuse the sched_domains_curr_level hack since
> + * tl->mask() below can resolve to sd_numa_mask()
> + * for the NODE domain.
> + */
> + sched_domains_curr_level = tl->numa_level;
> +#endif
> +
Urgh... Given this is now invoked after the build_sched_domain() loop, what
if we directly used the sched_domain_span(), instead, i.e. use
sched_domain_mask(per_cpu_ptr(tl->data->sd, cpu))
instead of
tl->mask(id)
which means no indrect use of sched_domains_curr_level. Note that I'm
currently running out of brain juice so this might be a really stupid idea :-)
> /*
> * Non-NUMA levels cannot partially overlap - they must be either
> * completely equal or completely disjoint. Otherwise we can end up
>
> base-commit: 914873bc7df913db988284876c16257e6ab772c6
> --
> 2.34.1
Powered by blists - more mailing lists