lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4d9f64ea691b8a2f7571671156d511407aeee1c8.camel@linux.intel.com>
Date: Mon, 25 Aug 2025 13:05:35 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>, Peter Zijlstra
 <peterz@...radead.org>,  Ingo Molnar <mingo@...hat.com>
Cc: Juri Lelli <juri.lelli@...hat.com>, Dietmar Eggemann
 <dietmar.eggemann@....com>, Ben Segall <bsegall@...gle.com>, Mel Gorman
 <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, Tim Chen
 <tim.c.chen@...el.com>, Vincent Guittot <vincent.guittot@...aro.org>, Libo
 Chen <libo.chen@...cle.com>, Abel Wu <wuyun.abel@...edance.com>, Len Brown
 <len.brown@...el.com>, linux-kernel@...r.kernel.org, K Prateek Nayak
 <kprateek.nayak@....com>, "Gautham R . Shenoy" <gautham.shenoy@....com>, 
 Zhao Liu <zhao1.liu@...el.com>, Vinicius Costa Gomes
 <vinicius.gomes@...el.com>, Chen Yu <yu.chen.surf@...mail.com>
Subject: Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X,
 CWF-X in SNC-3 mode

On Mon, 2025-08-25 at 13:08 +0800, Chen, Yu C wrote:
> On 8/23/2025 4:14 AM, Tim Chen wrote:
> > 
... snip...
> > 
> > Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@...el.com>
> > Tested-by: Zhao Liu <zhao1.liu@...el.com>
> > ---
> >   arch/x86/kernel/smpboot.c      | 28 ++++++++++++++++++++++++++++
> >   include/linux/sched/topology.h |  1 +
> >   kernel/sched/topology.c        | 25 +++++++++++++++++++------
> >   3 files changed, 48 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 33e166f6ab12..c425e84c88b5 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
> >   	set_sched_topology(topology);
> >   }
> >   
> > +int sched_node_distance(int from, int to)
> > +{
> > +	int d = node_distance(from, to);
> > +
> > +	if (!x86_has_numa_in_package)
> > +		return d;
> > +
> > +	switch (boot_cpu_data.x86_vfm) {
> > +	case INTEL_GRANITERAPIDS_X:
> > +	case INTEL_ATOM_DARKMONT_X:
> > +		if (d < REMOTE_DISTANCE)
> > +			return d;
> > +
> > +		/*
> > +		 * Trim finer distance tuning for nodes in remote package
> > +		 * for the purpose of building sched domains.
> > +		 * Put NUMA nodes in each remote package in a single sched group.
> > +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> > +		 * NUMA nodes in remote packages.
> > +		 *
> > +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > +		 * turned on.
> > +		 */
> > +		d = (d / 10) * 10;
> 
> Does the '10' here mean that, the distance of the hierarchy socket
> is 10 from SLIT table? 
> 

Yes.

> For example, from a socket0 point of view,
> the distance of socket1 to socket0 is within [20, 29), the distance
> of socket2 to socket0 is [30,39), and so on. If this is the case,
> maybe add a comment above for future reference.
> 

We don't expect to have more than 2 sockets for GNR and CWF.
So the case of 2 hops like [30,39) should not happen.

> > +	}
> > +	return d;
> > +}
> > +
> >   void set_cpu_sibling_map(int cpu)
> >   {
> >   	bool has_smt = __max_threads_per_core > 1;
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 5263746b63e8..3b62226394af 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -59,6 +59,7 @@ static inline int cpu_numa_flags(void)
> >   #endif
> >   
> >   extern int arch_asym_cpu_priority(int cpu);
> > +extern int sched_node_distance(int from, int to);
> >   
> >   struct sched_domain_attr {
> >   	int relax_domain_level;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 9a7ac67e3d63..3f485da994a7 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1804,7 +1804,7 @@ bool find_numa_distance(int distance)
> >   	bool found = false;
> >   	int i, *distances;
> >   
> > -	if (distance == node_distance(0, 0))
> > +	if (distance == sched_node_distance(0, 0))
> >   		return true;
> > 
> 
> If I understand correct, this patch is trying to fix the sched
> domain issue during load balancing, and NUMA balance logic
> should not be changed because NUMA balancing is not based on
> sched domain?
> 
> That is to say, since the find_numa_distance() is only used by
> NUMA balance, should we keep find_numa_distance() to still use
> node_distance()?

The procedure here is using the distance matrix that's initialized
using sched_node_distance(). Hence the change.

Otherwise we could keep a separate sched_distance matrix and uses
only node_distance here.  Did not do that to minimize the change.

Tim

> 
> >   	rcu_read_lock();
> > @@ -1887,6 +1887,15 @@ static void init_numa_topology_type(int offline_node)
> >   
> >   #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
> >   
> > +/*
> > + * Architecture could simplify NUMA distance, to avoid
> > + * creating too many NUMA levels when SNC is turned on.
> > + */
> > +int __weak sched_node_distance(int from, int to)
> > +{
> > +	return node_distance(from, to);
> > +}
> > +
> >   void sched_init_numa(int offline_node)
> >   {
> >   	struct sched_domain_topology_level *tl;
> > @@ -1894,6 +1903,7 @@ void sched_init_numa(int offline_node)
> >   	int nr_levels = 0;
> >   	int i, j;
> >   	int *distances;
> > +	int max_dist = 0;
> >   	struct cpumask ***masks;
> >   
> >   	/*
> > @@ -1907,7 +1917,10 @@ void sched_init_numa(int offline_node)
> >   	bitmap_zero(distance_map, NR_DISTANCE_VALUES);
> >   	for_each_cpu_node_but(i, offline_node) {
> >   		for_each_cpu_node_but(j, offline_node) {
> > -			int distance = node_distance(i, j);
> > +			int distance = sched_node_distance(i, j);
> > +
> > +			if (node_distance(i,j) > max_dist)
> > +				max_dist = node_distance(i,j);
> >   
> >   			if (distance < LOCAL_DISTANCE || distance >= NR_DISTANCE_VALUES) {
> >   				sched_numa_warn("Invalid distance value range");
> > @@ -1979,10 +1992,10 @@ void sched_init_numa(int offline_node)
> >   			masks[i][j] = mask;
> >   
> >   			for_each_cpu_node_but(k, offline_node) {
> > -				if (sched_debug() && (node_distance(j, k) != node_distance(k, j)))
> > +				if (sched_debug() && (sched_node_distance(j, k) != sched_node_distance(k, j)))
> >   					sched_numa_warn("Node-distance not symmetric");
> >   
> > -				if (node_distance(j, k) > sched_domains_numa_distance[i])
> > +				if (sched_node_distance(j, k) > sched_domains_numa_distance[i])
> >   					continue;
> >   
> >   				cpumask_or(mask, mask, cpumask_of_node(k));
> > @@ -2022,7 +2035,7 @@ void sched_init_numa(int offline_node)
> >   	sched_domain_topology = tl;
> >   
> >   	sched_domains_numa_levels = nr_levels;
> > -	WRITE_ONCE(sched_max_numa_distance, sched_domains_numa_distance[nr_levels - 1]);
> > +	WRITE_ONCE(sched_max_numa_distance, max_dist);
> 
> Above change is to use the original node_distance() rather than
> sched_node_distance() for sched_max_numa_distance, and
> sched_max_numa_distance is only used by NUMA balance to figure out
> the NUMA topology type as well as scaling the NUMA fault statistics
> for remote Nodes.
> 
> So I think we might want to keep it align by using node_distance()
> in find_numa_distance().
> 
> thanks,
> Chenyu
> >   
> >   	init_numa_topology_type(offline_node);
> >   }
> > @@ -2092,7 +2105,7 @@ void sched_domains_numa_masks_set(unsigned int cpu)
> >   				continue;
> >   
> >   			/* Set ourselves in the remote node's masks */
> > -			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> > +			if (sched_node_distance(j, node) <= sched_domains_numa_distance[i])
> >   				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> >   		}
> >   	}


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ