linux-kernel - Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X in SNC-3 mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250825075642.GQ3245006@noisy.programming.kicks-ass.net>
Date: Mon, 25 Aug 2025 09:56:42 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Tim Chen <tim.c.chen@...ux.intel.com>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	Tim Chen <tim.c.chen@...el.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Libo Chen <libo.chen@...cle.com>,
	Abel Wu <wuyun.abel@...edance.com>, Len Brown <len.brown@...el.com>,
	linux-kernel@...r.kernel.org,
	K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>,
	Zhao Liu <zhao1.liu@...el.com>,
	Vinicius Costa Gomes <vinicius.gomes@...el.com>,
	Chen Yu <yu.chen.surf@...mail.com>
Subject: Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X, CWF-X
 in SNC-3 mode

On Mon, Aug 25, 2025 at 01:08:39PM +0800, Chen, Yu C wrote:
> On 8/23/2025 4:14 AM, Tim Chen wrote:
> > It is possible for Granite Rapids X (GNR) and Clearwater Forest X
> > (CWF) to have up to 3 dies per package. When sub-numa cluster (SNC-3)
> > is enabled, each die will become a separate NUMA node in the package
> > with different distances between dies within the same package.
> > 
> > For example, on GNR-X, we see the following numa distances for a 2 socket
> > system with 3 dies per socket:
> > 
> >          package 1       package2
> >              ----------------
> >              |               |
> >          ---------       ---------
> >          |   0   |       |   3   |
> >          ---------       ---------
> >              |               |
> >          ---------       ---------
> >          |   1   |       |   4   |
> >          ---------       ---------
> >              |               |
> >          ---------       ---------
> >          |   2   |       |   5   |
> >          ---------       ---------
> >              |               |
> >              ----------------
> > 
> > node distances:
> > node     0    1    2    3    4    5
> >     0:   10   15   17   21   28   26
> >     1:   15   10   15   23   26   23
> >     2:   17   15   10   26   23   21
> >     3:   21   28   26   10   15   17
> >     4:   23   26   23   15   10   15
> >     5:   26   23   21   17   15   10
> > 

> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 33e166f6ab12..c425e84c88b5 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -515,6 +515,34 @@ static void __init build_sched_topology(void)
> >   	set_sched_topology(topology);
> >   }
> > +int sched_node_distance(int from, int to)
> > +{
> > +	int d = node_distance(from, to);
> > +
> > +	if (!x86_has_numa_in_package)
> > +		return d;
> > +
> > +	switch (boot_cpu_data.x86_vfm) {
> > +	case INTEL_GRANITERAPIDS_X:
> > +	case INTEL_ATOM_DARKMONT_X:
> > +		if (d < REMOTE_DISTANCE)
> > +			return d;
> > +
> > +		/*
> > +		 * Trim finer distance tuning for nodes in remote package
> > +		 * for the purpose of building sched domains.
> > +		 * Put NUMA nodes in each remote package in a single sched group.
> > +		 * Simplify NUMA domains and avoid extra NUMA levels including different
> > +		 * NUMA nodes in remote packages.
> > +		 *
> > +		 * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > +		 * turned on.
> > +		 */
> > +		d = (d / 10) * 10;
> 
> Does the '10' here mean that, the distance of the hierarchy socket
> is 10 from SLIT table? For example, from a socket0 point of view,
> the distance of socket1 to socket0 is within [20, 29), the distance
> of socket2 to socket0 is [30,39), and so on. If this is the case,
> maybe add a comment above for future reference.

This is all because of the ACPI SLIT distance definitions I suppose, 10
for local and 20 for remote (which IMO is actively wrong, since it
mandates distances that are not relative performance).

Additionally, the table above magically has all the remote distances in
the range of [20,29] and so the strip 1s thing works.

The problem of course is that the SLIT table is fully under control of
the BIOS and random BIOS monkey could cause this to not be so making the
above code not work as intended. Eg. if the remote distances ends up
being in the range of [20,35] or whatever, then it all goes sideways.

( There is a history of manupulating the SLIT table to influence
scheduler behaviour of OS of choice :-/ )

Similarly, when doing a 4 node system, it is possible a 2 hop distances
doesn't align nicely with the 10s and we're up a creek again.

This is all very fragile. A much better way would be to allocate a new
SLIT table, identify the (local) clusters and replace all remote
instances with an average.

Eg. since (21+28+26+23+26+23+26+23+21)/9 ~ 24, you end up with:

 node     0    1    2    3    4    5
     0:   10   15   17   24   24   24
     1:   15   10   15   24   24   24
     2:   17   15   10   24   24   24
     3:   24   24   24   10   15   17
     4:   24   24   24   15   10   15
     5:   24   24   24   17   15   10