lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aGKhlYO_SJaNm1mJ@swahl-home.5wahls.com>
Date: Mon, 30 Jun 2025 09:39:17 -0500
From: Steve Wahl <steve.wahl@....com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Valentin Schneider <vschneid@...hat.com>,
        Leon Romanovsky <leon@...nel.org>, linux-kernel@...r.kernel.org,
        Steve Wahl <steve.wahl@....com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>,
        Mel Gorman <mgorman@...e.de>,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH v2] sched/fair: Use sched_domain_span() for
 topology_span_sane()

Prateek,

This looks good to me.

Reviewed-by: Steve Wahl <steve.wahl@....com>


On Mon, Jun 30, 2025 at 06:10:59AM +0000, K Prateek Nayak wrote:
> Leon noted a topology_span_sane() warning in their guest deployment
> starting from v6.16-rc1 [1]. Debug that followed pointed to the
> tl->mask() for the NODE domain being incorrectly resolved to that of the
> highest NUMA domain.
> 
> tl->mask() for NODE is set to the sd_numa_mask() which depends on the
> global "sched_domains_curr_level" hack. "sched_domains_curr_level" is
> set to the "tl->numa_level" during tl traversal in build_sched_domains()
> calling sd_init() but was not reset before topology_span_sane().
> 
> Since "tl->numa_level" still reflected the old value from
> build_sched_domains(), topology_span_sane() for the NODE domain trips
> when the span of the last NUMA domain overlaps.
> 
> Instead of replicating the "sched_domains_curr_level" hack, Valentin
> suggested using the spans from the sched_domain objects constructed
> during build_sched_domains() which can also catch overlaps when the
> domain spans are fixed up by build_sched_domain().
> 
> The original warning was reproducble on the follwoing NUMA topology
> reported by Leon:
> 
>     $ sudo numactl -H
>     available: 5 nodes (0-4)
>     node 0 cpus: 0 1
>     node 0 size: 2927 MB
>     node 0 free: 1603 MB
>     node 1 cpus: 2 3
>     node 1 size: 3023 MB
>     node 1 free: 3008 MB
>     node 2 cpus: 4 5
>     node 2 size: 3023 MB
>     node 2 free: 3007 MB
>     node 3 cpus: 6 7
>     node 3 size: 3023 MB
>     node 3 free: 3002 MB
>     node 4 cpus: 8 9
>     node 4 size: 3022 MB
>     node 4 free: 2718 MB
>     node distances:
>     node   0   1   2   3   4
>       0:  10  39  38  37  36
>       1:  39  10  38  37  36
>       2:  38  38  10  37  36
>       3:  37  37  37  10  36
>       4:  36  36  36  36  10
> 
> The above topology can be mimicked using the following QEMU cmd that was
> used to reproduce the warning and test the fix:
> 
>      sudo qemu-system-x86_64 -enable-kvm -cpu host \
>      -m 20G -smp cpus=10,sockets=10 -machine q35 \
>      -object memory-backend-ram,size=4G,id=m0 \
>      -object memory-backend-ram,size=4G,id=m1 \
>      -object memory-backend-ram,size=4G,id=m2 \
>      -object memory-backend-ram,size=4G,id=m3 \
>      -object memory-backend-ram,size=4G,id=m4 \
>      -numa node,cpus=0-1,memdev=m0,nodeid=0 \
>      -numa node,cpus=2-3,memdev=m1,nodeid=1 \
>      -numa node,cpus=4-5,memdev=m2,nodeid=2 \
>      -numa node,cpus=6-7,memdev=m3,nodeid=3 \
>      -numa node,cpus=8-9,memdev=m4,nodeid=4 \
>      -numa dist,src=0,dst=1,val=39 \
>      -numa dist,src=0,dst=2,val=38 \
>      -numa dist,src=0,dst=3,val=37 \
>      -numa dist,src=0,dst=4,val=36 \
>      -numa dist,src=1,dst=0,val=39 \
>      -numa dist,src=1,dst=2,val=38 \
>      -numa dist,src=1,dst=3,val=37 \
>      -numa dist,src=1,dst=4,val=36 \
>      -numa dist,src=2,dst=0,val=38 \
>      -numa dist,src=2,dst=1,val=38 \
>      -numa dist,src=2,dst=3,val=37 \
>      -numa dist,src=2,dst=4,val=36 \
>      -numa dist,src=3,dst=0,val=37 \
>      -numa dist,src=3,dst=1,val=37 \
>      -numa dist,src=3,dst=2,val=37 \
>      -numa dist,src=3,dst=4,val=36 \
>      -numa dist,src=4,dst=0,val=36 \
>      -numa dist,src=4,dst=1,val=36 \
>      -numa dist,src=4,dst=2,val=36 \
>      -numa dist,src=4,dst=3,val=36 \
>      ...
> 
> Cc: Steve Wahl <steve.wahl@....com>
> Suggested-by: Valentin Schneider <vschneid@...hat.com>
> Reported-by: Leon Romanovsky <leon@...nel.org>
> Closes: https://lore.kernel.org/lkml/20250610110701.GA256154@unreal/  [1]
> Fixes: ccf74128d66c ("sched/topology: Assert non-NUMA topology masks don't (partially) overlap") # ce29a7da84cd, f55dac1dafb3
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> ---
> v1..v2:
> 
> o Use sched_domain_span() instead of replicating the
>   "sched_domains_curr_level" hack (Valentin)
> 
> o Included the QEMU cmd in the commit message for the record (Valentin)
> 
> v1: https://lore.kernel.org/lkml/20250624041235.1589-1-kprateek.nayak@amd.com/ 
> 
> Changes are based on tip:sched/urgent at commit 914873bc7df9 ("Merge tag
> 'x86-build-2025-05-25' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
> ---
>  kernel/sched/topology.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index a2a38e1b6f18..734fee573992 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2418,6 +2418,7 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
>  	id_seen = sched_domains_tmpmask2;
>  
>  	for_each_sd_topology(tl) {
> +		struct sd_data *sdd = &tl->data;
>  
>  		/* NUMA levels are allowed to overlap */
>  		if (tl->flags & SDTL_OVERLAP)
> @@ -2433,22 +2434,24 @@ static bool topology_span_sane(const struct cpumask *cpu_map)
>  		 * breaks the linking done for an earlier span.
>  		 */
>  		for_each_cpu(cpu, cpu_map) {
> -			const struct cpumask *tl_cpu_mask = tl->mask(cpu);
> +			struct sched_domain *sd = *per_cpu_ptr(sdd->sd, cpu);
> +			struct cpumask *sd_span = sched_domain_span(sd);
>  			int id;
>  
>  			/* lowest bit set in this mask is used as a unique id */
> -			id = cpumask_first(tl_cpu_mask);
> +			id = cpumask_first(sd_span);
>  
>  			if (cpumask_test_cpu(id, id_seen)) {
> -				/* First CPU has already been seen, ensure identical spans */
> -				if (!cpumask_equal(tl->mask(id), tl_cpu_mask))
> +				/* First CPU has already been seen, ensure identical sd spans */
> +				sd = *per_cpu_ptr(sdd->sd, id);
> +				if (!cpumask_equal(sched_domain_span(sd), sd_span))
>  					return false;
>  			} else {
>  				/* First CPU hasn't been seen before, ensure it's a completely new span */
> -				if (cpumask_intersects(tl_cpu_mask, covered))
> +				if (cpumask_intersects(sd_span, covered))
>  					return false;
>  
> -				cpumask_or(covered, covered, tl_cpu_mask);
> +				cpumask_or(covered, covered, sd_span);
>  				cpumask_set_cpu(id, id_seen);
>  			}
>  		}
> 
> base-commit: 914873bc7df913db988284876c16257e6ab772c6
> -- 
> 2.34.1
> 

-- 
Steve Wahl, Hewlett Packard Enterprise

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ