linux-kernel - Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250509123131.0000051b@huawei.com>
Date: Fri, 9 May 2025 12:31:31 +0100
From: Jonathan Cameron <Jonathan.Cameron@...wei.com>
To: Gregory Price <gourry@...rry.net>
CC: Rakie Kim <rakie.kim@...com>, <joshua.hahnjy@...il.com>,
	<akpm@...ux-foundation.org>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <linux-cxl@...r.kernel.org>,
	<dan.j.williams@...el.com>, <ying.huang@...ux.alibaba.com>,
	<kernel_team@...ynix.com>, <honggyu.kim@...com>, <yunjeong.mun@...com>,
	"Keith Busch" <kbusch@...nel.org>, Jerome Glisse <jglisse@...gle.com>
Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in
 weighted interleave

On Thu, 8 May 2025 11:12:35 -0400
Gregory Price <gourry@...rry.net> wrote:

> On Thu, May 08, 2025 at 03:30:36PM +0900, Rakie Kim wrote:
> > On Wed, 7 May 2025 12:38:18 -0400 Gregory Price <gourry@...rry.net> wrote:
> > 
> > The proposed design is completely optional and isolated: it retains the
> > existing flat weight model as-is and activates the source-aware behavior only
> > when 'multi' mode is enabled. The complexity is scoped entirely to users who
> > opt into this mode.
> >   
> 
> I get what you're going for, just expressing my experience around this
> issue specifically.
> 
> The lack of enthusiasm for solving the cross-socket case, and thus
> reduction from a 2D array to a 1D array, was because reasoning about
> interleave w/ cross-socket interconnects is not really feasible with
> the NUMA abstraction.  Cross-socket interconnects are "Invisible" but
> have real performance implications.  Unless we have a way to:

Sort of invisible...  What their topology is, but we have some info...

> 
> 1) Represent the topology, AND
> 2) A way to get performance about that topology

There was some discussion on this at LSF-MM.

+CC Keith and Jerome who were once interested in this topic

It's not perfect but ACPI HMAT does have what is probably sufficient info
for a simple case like this (2 socket server + Generic Ports and CXL
description of the rest of the path), it's just that today we aren't exposing that
to userspace (instead only the BW / Latency from a single selected nearest initiator
/CPU node to any memory containing node).

That decision was much discussed back when Keith was adding HMAT support.
At that time the question was what workload needed the dense info (2D matrix)
and we didn't have one.  With weighted interleave I think we do.

As to the problems...

We come unstuck badly in much more complex situations as that information
is load free so if we have heavy contention due to one shared link between
islands of nodes it can give a very misleading idea.

  [CXL Node 0]                         [CXL Node 2]
       |                                    |
   [NODE A]---\                    /----[NODE C]
               \___Shared link____/ 
               /                  \
   [NODE B]---/                    \----[NODE D]
       |                                   |
  [CXL Node 1]                         [CXL Node 3]

In this from ACPI this looks much like this (fully connected
4 socket system).

  [CXL Node 0]                         [CXL Node 2]
       |                                    |
   [NODE A]-----------------------------[NODE C]
       |   \___________________________   / | 
       |    ____________________________\/  |  
       |   /                             \  | 
   [NODE B]-----------------------------[NODE D]
       |                                   |
  [CXL Node 1]                         [CXL Node 3]

In the first case we should probably halve the BW of shared link or something
like that. In the second case use the full version. In general we have no way
to know which one we have and it gets way more fun with 8 + sockets :)

SLIT is indeed useless for anything other than what's nearest decisions

Anyhow, short term I'd like us to revisit what info we present from HMAT
(and what we get from CXL topology descriptions which have pretty much everything we
might want).

That should put the info in userspace to tune weighted interleave better anyway
and perhaps provide the info you need here. 

So just all the other problems to solve ;)

J

> 
> It's not useful. So NUMA is an incomplete (if not wrong) tool for this.
> 
> Additionally - reacting to task migration is not a real issue.  If
> you're deploying an allocation strategy, you probably don't want your
> task migrating away from the place where you just spent a bunch of time
> allocating based on some existing strategy.  So the solution is: don't
> migrate, and if you do - don't use cross-socket interleave.
> 
> Maybe if we solve the first half of this we can take a look at the task
> migration piece again, but I wouldn't try to solve for migration.
> 
> At the same time we were discussing this, we were also discussing how to
> do external task-mempolicy modifications - which seemed significantly
> more useful, but ultimately more complex and without sufficient
> interested parties / users.
> 
> ~Gregory
>