linux-kernel - Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250512082320.274-1-rakie.kim@sk.com>
Date: Mon, 12 May 2025 17:23:14 +0900
From: Rakie Kim <rakie.kim@...com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>
Cc: Rakie Kim <rakie.kim@...com>,
	joshua.hahnjy@...il.com,
	akpm@...ux-foundation.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	linux-cxl@...r.kernel.org,
	dan.j.williams@...el.com,
	ying.huang@...ux.alibaba.com,
	kernel_team@...ynix.com,
	honggyu.kim@...com,
	yunjeong.mun@...com,
	"Keith Busch" <kbusch@...nel.org>,
	Jerome Glisse <jglisse@...gle.com>,
	Gregory Price <gourry@...rry.net>
Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave

On Fri, 9 May 2025 12:31:31 +0100 Jonathan Cameron <Jonathan.Cameron@...wei.com> wrote:
> On Thu, 8 May 2025 11:12:35 -0400
> Gregory Price <gourry@...rry.net> wrote:
> 
> > On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote:
> > > On Wed, 7 May 2025 12:38:18 -0400 Gregory Price <gourry@...rry.net> wrote:
> > > 
> > > The proposed design is completely optional and isolated: it retains the
> > > existing flat weight model as-is and activates the source-aware behavior only
> > > when 'multi' mode is enabled. The complexity is scoped entirely to users who
> > > opt into this mode.
> > >   
> > 
> > I get what you're going for, just expressing my experience around this
> > issue specifically.
> > 
> > The lack of enthusiasm for solving the cross-socket case, and thus
> > reduction from a 2D array to a 1D array, was because reasoning about
> > interleave w/ cross-socket interconnects is not really feasible with
> > the NUMA abstraction.  Cross-socket interconnects are "Invisible" but
> > have real performance implications.  Unless we have a way to:
> 
> Sort of invisible...  What their topology is, but we have some info...
> 
> > 
> > 1) Represent the topology, AND
> > 2) A way to get performance about that topology
> 
> There was some discussion on this at LSF-MM.
> 
> +CC Keith and Jerome who were once interested in this topic
> 
> It's not perfect but ACPI HMAT does have what is probably sufficient info
> for a simple case like this (2 socket server + Generic Ports and CXL
> description of the rest of the path), it's just that today we aren't exposing that
> to userspace (instead only the BW / Latency from a single selected nearest initiator
> /CPU node to any memory containing node).
> 
> That decision was much discussed back when Keith was adding HMAT support.
> At that time the question was what workload needed the dense info (2D matrix)
> and we didn't have one.  With weighted interleave I think we do.
> 
> As to the problems...
> 
> We come unstuck badly in much more complex situations as that information
> is load free so if we have heavy contention due to one shared link between
> islands of nodes it can give a very misleading idea.
> 
>   [CXL Node 0]                         [CXL Node 2]
>        |                                    |
>    [NODE A]---\                    /----[NODE C]
>                \___Shared link____/ 
>                /                  \
>    [NODE B]---/                    \----[NODE D]
>        |                                   |
>   [CXL Node 1]                         [CXL Node 3]
> 
> In this from ACPI this looks much like this (fully connected
> 4 socket system).
> 
>   [CXL Node 0]                         [CXL Node 2]
>        |                                    |
>    [NODE A]-----------------------------[NODE C]
>        |   \___________________________   / | 
>        |    ____________________________\/  |  
>        |   /                             \  | 
>    [NODE B]-----------------------------[NODE D]
>        |                                   |
>   [CXL Node 1]                         [CXL Node 3]
> 
> In the first case we should probably halve the BW of shared link or something
> like that. In the second case use the full version. In general we have no way
> to know which one we have and it gets way more fun with 8 + sockets :)
> 
> SLIT is indeed useless for anything other than what's nearest decisions
> 
> Anyhow, short term I'd like us to revisit what info we present from HMAT
> (and what we get from CXL topology descriptions which have pretty much everything we
> might want).
> 
> That should put the info in userspace to tune weighted interleave better anyway
> and perhaps provide the info you need here. 
> 
> So just all the other problems to solve ;)
> 
> J

Jonathan, thank you very much for your thoughtful response.

As you pointed out, ACPI HMAT and CXL topology descriptions do contain
meaningful information for simple systems such as two-socket platforms.
If that information were made more accessible to userspace, I believe
existing memory policies could be tuned with much greater precision.

I fully understand that such detailed topology data was not widely
exposed in the past, largely because there was little demand for it.
However, with the growing complexity of memory hierarchies in modern
systems, I believe its relevance and utility are increasing rapidly.

I also appreciate your point about the risks of misrepresentation in
more complex systems, especially where shared interconnect links can
cause bandwidth bottlenecks. That nuance is critical to consider when
designing or interpreting any policy relying on topology data.

In the short term, I fully agree that revisiting what information is
presented from HMAT and CXL topology and how we surface it to
userspace is a realistic and meaningful direction.

Thank you again for your insights, and I look forward to continuing the
discussion.

Rakie

> 
> > 
> > It's not useful. So NUMA is an incomplete (if not wrong) tool for this.
> > 
> > Additionally - reacting to task migration is not a real issue.  If
> > you're deploying an allocation strategy, you probably don't want your
> > task migrating away from the place where you just spent a bunch of time
> > allocating based on some existing strategy.  So the solution is: don't
> > migrate, and if you do - don't use cross-socket interleave.
> > 
> > Maybe if we solve the first half of this we can take a look at the task
> > migration piece again, but I wouldn't try to solve for migration.
> > 
> > At the same time we were discussing this, we were also discussing how to
> > do external task-mempolicy modifications - which seemed significantly
> > more useful, but ultimately more complex and without sufficient
> > interested parties / users.
> > 
> > ~Gregory
> > 
> 
>