[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <65e8f7e3f4bc039f529a2ed6cbad68e121a26306.camel@linux.intel.com>
Date: Mon, 25 Aug 2025 14:36:47 -0700
From: Tim Chen <tim.c.chen@...ux.intel.com>
To: Peter Zijlstra <peterz@...radead.org>, "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall
<bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
<vschneid@...hat.com>, Tim Chen <tim.c.chen@...el.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Libo Chen <libo.chen@...cle.com>, Abel Wu
<wuyun.abel@...edance.com>, Len Brown <len.brown@...el.com>,
linux-kernel@...r.kernel.org, K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>, Zhao Liu
<zhao1.liu@...el.com>, Vinicius Costa Gomes <vinicius.gomes@...el.com>,
Chen Yu <yu.chen.surf@...mail.com>, Arjan van de Ven <arjan@...ux.intel.com>
Subject: Re: [PATCH 2/2] sched: Fix sched domain build error for GNR-X,
CWF-X in SNC-3 mode
On Mon, 2025-08-25 at 09:56 +0200, Peter Zijlstra wrote:
> >
... snip ...
> > > > > > + /*
> > > > > > + * Trim finer distance tuning for nodes in remote package
> > > > > > + * for the purpose of building sched domains.
> > > > > > + * Put NUMA nodes in each remote package in a single sched group.
> > > > > > + * Simplify NUMA domains and avoid extra NUMA levels including different
> > > > > > + * NUMA nodes in remote packages.
> > > > > > + *
> > > > > > + * GNR-x and CWF-X has GLUELESS-MESH topology with SNC
> > > > > > + * turned on.
> > > > > > + */
> > > > > > + d = (d / 10) * 10;
> > > >
> > > > Does the '10' here mean that, the distance of the hierarchy socket
> > > > is 10 from SLIT table? For example, from a socket0 point of view,
> > > > the distance of socket1 to socket0 is within [20, 29), the distance
> > > > of socket2 to socket0 is [30,39), and so on. If this is the case,
> > > > maybe add a comment above for future reference.
> >
> > This is all because of the ACPI SLIT distance definitions I suppose, 10
> > for local and 20 for remote (which IMO is actively wrong, since it
> > mandates distances that are not relative performance).
> >
> > Additionally, the table above magically has all the remote distances in
> > the range of [20,29] and so the strip 1s thing works.
> >
> > The problem of course is that the SLIT table is fully under control of
> > the BIOS and random BIOS monkey could cause this to not be so making the
> > above code not work as intended. Eg. if the remote distances ends up
> > being in the range of [20,35] or whatever, then it all goes sideways.
> >
> > ( There is a history of manupulating the SLIT table to influence
> > scheduler behaviour of OS of choice :-/ )
> >
> > Similarly, when doing a 4 node system, it is possible a 2 hop distances
> > doesn't align nicely with the 10s and we're up a creek again.
We don't expect 4 node systems for GNR nor CWF. So hopefully we don't need to
worry about them. Otherwise we may need additional code to check for 2 hops.
> >
> > This is all very fragile. A much better way would be to allocate a new
> > SLIT table, identify the (local) clusters and replace all remote
> > instances with an average.
Are you suggesting to have one SLIT distance table that's simplified for
scheduler domain build and another for true node distance?
> >
> > Eg. since (21+28+26+23+26+23+26+23+21)/9 ~ 24, you end up with:
> >
> > node 0 1 2 3 4 5
> > 0: 10 15 17 24 24 24
> > 1: 15 10 15 24 24 24
> > 2: 17 15 10 24 24 24
> > 3: 24 24 24 10 15 17
> > 4: 24 24 24 15 10 15
> > 5: 24 24 24 17 15 10
> >
> >
Will take a closer look to use average for nodes
in remote package.
Tim
Powered by blists - more mailing lists