lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250624041235.1589-1-kprateek.nayak@amd.com>
Date: Tue, 24 Jun 2025 04:12:35 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Leon Romanovsky <leon@...nel.org>,
	<linux-kernel@...r.kernel.org>
CC: Steve Wahl <steve.wahl@....com>, Dietmar Eggemann
	<dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben Segall
	<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, "K
 Prateek Nayak" <kprateek.nayak@....com>
Subject: [PATCH] sched/topology: Correct "sched_domains_curr_level" in topology_span_sane()

The updated topology_span_sane() algorithm in commit ce29a7da84cd
("sched/topology: Refinement to topology_span_sane speedup") works on
the "sched_domain_topology_level" hierarchy to detect overlap in
!SDTL_OVERLAP domains using the tl->mask() as opposed to the
sched_domain hierarchy (and the sched_domain_span()) in the previous
approach.

For NODE domain, the cpumask retunred by tl->mask() depends on the
"sched_domains_curr_level". Unless the "sched_domains_curr_level" is
reset during topology_span_sane(), it reflects the "numa_level" of the
last sched_domain created which is incorrect.

Reset the "sched_domains_curr_level" to the "tl->numa_level" during
topology_span_sane(). Although setting "topology_span_sane" to 0 in
topology_span_sane() should be enough since all domains with
numa_level > 0 currently set SDTL_OVERLAP flag, setting it to
"tl->numa_level" makes it more explicit and future proof against changes
in the same area.

Cc: Steve Wahl <steve.wahl@....com>
Reported-by: Leon Romanovsky <leon@...nel.org>
Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/
Fixes: ce29a7da84cd ("sched/topology: Refinement to topology_span_sane speedup")
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
This issue can be reproduced on a setup with the following NUMA topology
shared by Leon:

    $ sudo numactl -H
    available: 5 nodes (0-4)
    node 0 cpus: 0 1
    node 0 size: 2927 MB
    node 0 free: 1603 MB
    node 1 cpus: 2 3
    node 1 size: 3023 MB
    node 1 free: 3008 MB
    node 2 cpus: 4 5
    node 2 size: 3023 MB
    node 2 free: 3007 MB
    node 3 cpus: 6 7
    node 3 size: 3023 MB
    node 3 free: 3002 MB
    node 4 cpus: 8 9
    node 4 size: 3022 MB
    node 4 free: 2718 MB
    node distances:
    node   0   1   2   3   4 
      0:  10  39  38  37  36 
      1:  39  10  38  37  36 
      2:  38  38  10  37  36 
      3:  37  37  37  10  36 
      4:  36  36  36  36  10


This topology can be emulated using QEMU with the following cmdline used
in my testing:

     sudo ~/dev/qemu/build/qemu-system-x86_64 -enable-kvm \
     -cpu host \
     -m 20G -smp cpus=10,sockets=10 -machine q35 \
     -object memory-backend-ram,size=4G,id=m0 \
     -object memory-backend-ram,size=4G,id=m1 \
     -object memory-backend-ram,size=4G,id=m2 \
     -object memory-backend-ram,size=4G,id=m3 \
     -object memory-backend-ram,size=4G,id=m4 \
     -numa node,cpus=0-1,memdev=m0,nodeid=0 \
     -numa node,cpus=2-3,memdev=m1,nodeid=1 \
     -numa node,cpus=4-5,memdev=m2,nodeid=2 \
     -numa node,cpus=6-7,memdev=m3,nodeid=3 \
     -numa node,cpus=8-9,memdev=m4,nodeid=4 \
     -numa dist,src=0,dst=1,val=39 \
     -numa dist,src=0,dst=2,val=38 \
     -numa dist,src=0,dst=3,val=37 \
     -numa dist,src=0,dst=4,val=36 \
     -numa dist,src=1,dst=0,val=39 \
     -numa dist,src=1,dst=2,val=38 \
     -numa dist,src=1,dst=3,val=37 \
     -numa dist,src=1,dst=4,val=36 \
     -numa dist,src=2,dst=0,val=38 \
     -numa dist,src=2,dst=1,val=38 \
     -numa dist,src=2,dst=3,val=37 \
     -numa dist,src=2,dst=4,val=36 \
     -numa dist,src=3,dst=0,val=37 \
     -numa dist,src=3,dst=1,val=37 \
     -numa dist,src=3,dst=2,val=37 \
     -numa dist,src=3,dst=4,val=36 \
     -numa dist,src=4,dst=0,val=36 \
     -numa dist,src=4,dst=1,val=36 \
     -numa dist,src=4,dst=2,val=36 \
     -numa dist,src=4,dst=3,val=36 \
     ...


Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag
'x86-build-2025-05-25' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
---
 kernel/sched/topology.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index a2a38e1b6f18..1d634862c8df 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2426,6 +2426,15 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
 		cpumask_clear(covered);
 		cpumask_clear(id_seen);
 
+#ifdef CONFIG_NUMA
+		/*
+		 * Reuse the sched_domains_curr_level hack since
+		 * tl->mask() below can resolve to sd_numa_mask()
+		 * for the NODE domain.
+		 */
+		sched_domains_curr_level = tl->numa_level;
+#endif
+
 		/*
 		 * Non-NUMA levels cannot partially overlap - they must be either
 		 * completely equal or completely disjoint. Otherwise we can end up

base-commit: 914873bc7df913db988284876c16257e6ab772c6
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ