lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <xhsmha53jllo7.mognet@vschneid-thinkpadt14sgen2i.remote.csb>
Date: Thu, 28 Aug 2025 13:39:52 +0200
From: Valentin Schneider <vschneid@...hat.com>
To: K Prateek Nayak <kprateek.nayak@....com>, Ingo Molnar
 <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Juri Lelli
 <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
 Leon
 Romanovsky <leon@...nel.org>, linux-kernel@...r.kernel.org
Cc: Steve Wahl <steve.wahl@....com>, Dietmar Eggemann
 <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, Ben
 Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, K Prateek Nayak
 <kprateek.nayak@....com>, Tim Chen <tim.c.chen@...ux.intel.com>, Vinicius
 Costa Gomes <vinicius.gomes@...el.com>
Subject: Re: [PATCH v6] sched/fair: Get rid of sched_domains_curr_level hack
 for tl->cpumask()

On 25/08/25 12:02, K Prateek Nayak wrote:
> From: Peter Zijlstra <peterz@...radead.org>
>
> Leon [1] and Vinicius [2] noted a topology_span_sane() warning during
> their testing starting from v6.16-rc1. Debug that followed pointed to
> the tl->mask() for the NODE domain being incorrectly resolved to that of
> the highest NUMA domain.
>
> tl->mask() for NODE is set to the sd_numa_mask() which depends on the
> global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
> set to the "tl->numa_level" during tl traversal in build_sched_domains()
> calling sd_init() but was not reset before topology_span_sane().
>
> Since "tl->numa_level" still reflected the old value from
> build_sched_domains(), topology_span_sane() for the NODE domain trips
> when the span of the last NUMA domain overlaps.
>
> Instead of replicating the "sched_domains_curr_level" hack, get rid of
> it entirely and instead, pass the entire "sched_domain_topology_level"
> object to tl->cpumask() function to prevent such mishap in the future.
>
> sd_numa_mask() now directly references "tl->numa_level" instead of
> relying on the global "sched_domains_curr_level" hack to index into
> sched_domains_numa_masks[].
>

Oh am I happy to see that hack go :D

> The original warning was reproducible on the following NUMA topology
> reported by Leon:
>
>     $ sudo numactl -H
>     available: 5 nodes (0-4)
>     node 0 cpus: 0 1
>     node 0 size: 2927 MB
>     node 0 free: 1603 MB
>     node 1 cpus: 2 3
>     node 1 size: 3023 MB
>     node 1 free: 3008 MB
>     node 2 cpus: 4 5
>     node 2 size: 3023 MB
>     node 2 free: 3007 MB
>     node 3 cpus: 6 7
>     node 3 size: 3023 MB
>     node 3 free: 3002 MB
>     node 4 cpus: 8 9
>     node 4 size: 3022 MB
>     node 4 free: 2718 MB
>     node distances:
>     node   0   1   2   3   4
>       0:  10  39  38  37  36
>       1:  39  10  38  37  36
>       2:  38  38  10  37  36
>       3:  37  37  37  10  36
>       4:  36  36  36  36  10
>
> The above topology can be mimicked using the following QEMU cmd that was
> used to reproduce the warning and test the fix:
>
>      sudo qemu-system-x86_64 -enable-kvm -cpu host \
>      -m 20G -smp cpus=10,sockets=10 -machine q35 \
>      -object memory-backend-ram,size=4G,id=m0 \
>      -object memory-backend-ram,size=4G,id=m1 \
>      -object memory-backend-ram,size=4G,id=m2 \
>      -object memory-backend-ram,size=4G,id=m3 \
>      -object memory-backend-ram,size=4G,id=m4 \
>      -numa node,cpus=0-1,memdev=m0,nodeid=0 \
>      -numa node,cpus=2-3,memdev=m1,nodeid=1 \
>      -numa node,cpus=4-5,memdev=m2,nodeid=2 \
>      -numa node,cpus=6-7,memdev=m3,nodeid=3 \
>      -numa node,cpus=8-9,memdev=m4,nodeid=4 \
>      -numa dist,src=0,dst=1,val=39 \
>      -numa dist,src=0,dst=2,val=38 \
>      -numa dist,src=0,dst=3,val=37 \
>      -numa dist,src=0,dst=4,val=36 \
>      -numa dist,src=1,dst=0,val=39 \
>      -numa dist,src=1,dst=2,val=38 \
>      -numa dist,src=1,dst=3,val=37 \
>      -numa dist,src=1,dst=4,val=36 \
>      -numa dist,src=2,dst=0,val=38 \
>      -numa dist,src=2,dst=1,val=38 \
>      -numa dist,src=2,dst=3,val=37 \
>      -numa dist,src=2,dst=4,val=36 \
>      -numa dist,src=3,dst=0,val=37 \
>      -numa dist,src=3,dst=1,val=37 \
>      -numa dist,src=3,dst=2,val=37 \
>      -numa dist,src=3,dst=4,val=36 \
>      -numa dist,src=4,dst=0,val=36 \
>      -numa dist,src=4,dst=1,val=36 \
>      -numa dist,src=4,dst=2,val=36 \
>      -numa dist,src=4,dst=3,val=36 \
>      ...
>
>   [ prateek: Moved common functions to include/linux/sched/topology.h,
>     reuse the common bits for s390 and ppc, commit message ]
>
> Reported-by: Leon Romanovsky <leon@...nel.org>
> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/ [1]
> Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
> Link: https://lore.kernel.org/lkml/a3de98387abad28592e6ab591f3ff6107fe01dc1.1755893468.git.tim.c.chen@linux.intel.com/ [2]
> Not-yet-signed-off-by: Peter Zijlstra <peterz@...radead.org>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>

Reviewed-by: Valentin Schneider <vschneid@...hat.com>
Tested-by: Valentin Schneider <vschneid@...hat.com>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ