linux-kernel - [RFC] Add per-socket weight support for multi-socket systems in weighted interleave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250507093517.184-1-rakie.kim@sk.com>
Date: Wed,  7 May 2025 18:35:16 +0900
From: rakie.kim@...com
To: gourry@...rry.net,
	joshua.hahnjy@...il.com
Cc: akpm@...ux-foundation.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	linux-cxl@...r.kernel.org,
	dan.j.williams@...el.com,
	ying.huang@...ux.alibaba.com,
	kernel_team@...ynix.com,
	honggyu.kim@...com,
	yunjeong.mun@...com,
	rakie.kim@...com
Subject: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave

Hi Gregory, Joshua,

I hope this message finds you well. I'm writing to discuss a feature I
believe would enhance the flexibility of the weighted interleave policy:
support for per-socket weighting in multi-socket systems.

---

<Background and prior design context>

While reviewing the early versions of the weighted interleave patches,
I noticed that a source-aware weighting structure was included in v1:

  https://lore.kernel.org/all/20231207002759.51418-1-gregory.price@memverge.com/

However, this structure was removed in a later version:

  https://lore.kernel.org/all/20231209065931.3458-1-gregory.price@memverge.com/

Unfortunately, I was unable to participate in the discussion at that
time, and I sincerely apologize for missing it.

>From what I understand, there may have been valid reasons for removing
the source-relative design, including:

1. Increased complexity in mempolicy internals. Adding source awareness
   introduces challenges around dynamic nodemask changes, task policy
   sharing during fork(), mbind(), rebind(), etc.

2. A lack of concrete, motivating use cases. At that stage, it might
   have been more pragmatic to focus on a 1D flat weight array.

If there were additional reasons, I would be grateful to learn them.

That said, I would like to revisit this idea now, as I believe some
real-world NUMA configurations would benefit significantly from
reintroducing this capability.

---

<Motivation: realistic multi-socket memory topologies>

The system I am testing includes multiple CPU sockets, each with local
DRAM and directly attached CXL memory. Here's a simplified diagram:

          node0             node1
        +-------+   UPI   +-------+
        | CPU 0 |-+-----+-| CPU 1 |
        +-------+         +-------+
        | DRAM0 |         | DRAM1 |
        +---+---+         +---+---+
            |                 |
        +---+---+         +---+---+
        | CXL 0 |         | CXL 1 |
        +-------+         +-------+
          node2             node3

This type of system is becoming more common, and in my tests, I
encountered two scenarios where per-socket weighting would be highly
beneficial.

Let's assume the following NUMA bandwidth matrix (GB/s):

         0     1     2     3
     0  300   150   100    50
     1  150   300    50   100

And flat weights:

     node0 = 3
     node1 = 3
     node2 = 1
     node3 = 1

---

Scenario 1: Adapt weighting based on the task's execution node

Many applications can achieve reasonable performance just by using the
CXL memory on their local socket. However, most workloads do not pin
tasks to a specific CPU node, and the current implementation does not
adjust weights based on where the task is running.

If per-source-node weighting were available, the following matrix could
be used:

         0     1     2     3
     0   3     0     1     0
     1   0     3     0     1

Which means:

1. A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
2. A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
3. A large, multithreaded task using both sockets should get both sets

This flexibility is currently not possible with a single flat weight
array.

---

Scenario 2: Reflect relative memory access performance

Remote memory access (e.g., from node0 to node3) incurs a real bandwidth
penalty. Ideally, weights should reflect this. For example:

Bandwidth-based matrix:

         0     1     2     3
     0   6     3     2     1
     1   3     6     1     2

Or DRAM + local CXL only:

         0     1     2     3
     0   6     0     2     1
     1   0     6     1     2

While scenario 1 is probably more common in practice, both can be
expressed within the same design if per-socket weights are supported.

---

<Proposed approach>

Instead of removing the current sysfs interface or flat weight logic, I
propose introducing an optional "multi" mode for per-socket weights.
This would allow users to opt into source-aware behavior.
(The name 'multi' is just an example and should be changed to a more
appropriate name in the future.)

Draft sysfs layout:

  /sys/kernel/mm/mempolicy/weighted_interleave/
    +-- multi         (bool: enable per-socket mode)
    +-- node0         (flat weight for legacy/default mode)
    +-- node_groups/
        +-- node0_group/
        |   +-- node0  (weight of node0 when running on node0)
        |   +-- node1
        +-- node1_group/
            +-- node0
            +-- node1

- When `multi` is false (default), existing behavior applies
- When `multi` is true, the system will use per-task `task_numa_node()`
  to select a row in a 2D weight table

---

<Additional implementation considerations>

1. Compatibility: The proposal avoids breaking the current interface or
   behavior and remains backward-compatible.

2. Auto-tuning: Scenario 1 (local CXL + DRAM) likely works with minimal
   change. Scenario 2 (bandwidth-aware tuning) would require more
   development, and I would welcome Joshua's input on this.

3. Zero weights: Currently the minimum weight is 1. We may want to allow
   zero to fully support asymmetric exclusion.

---

<Next steps>

Before beginning an implementation, I would like to validate this
direction with both of you:

- Does this approach fit with your current design intentions?
- Do you foresee problems with complexity, policy sharing, or interface?
- Is there a better alternative to express this idea?

If there's interest, I would be happy to send an RFC patch or prototype.

Thank you for your time and consideration.

Sincerely,
Rakie