linux-kernel - Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z8-_SXm0JGjXTegL@gourry-fedora-PF4VCD3F>
Date: Tue, 11 Mar 2025 00:42:49 -0400
From: Gregory Price <gourry@...rry.net>
To: Yunjeong Mun <yunjeong.mun@...com>
Cc: kernel_team@...ynix.com, Joshua Hahn <joshua.hahnjy@...il.com>,
	harry.yoo@...cle.com, ying.huang@...ux.alibaba.com,
	gregkh@...uxfoundation.org, rakie.kim@...com,
	akpm@...ux-foundation.org, rafael@...nel.org, lenb@...nel.org,
	dan.j.williams@...el.com, Jonathan.Cameron@...wei.com,
	dave.jiang@...el.com, horen.chuang@...ux.dev, hannes@...xchg.org,
	linux-kernel@...r.kernel.org, linux-acpi@...r.kernel.org,
	linux-mm@...ck.org, kernel-team@...a.com,
	Honggyu Kim <honggyu.kim@...com>
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for
 memoryless nodes

On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:

forenote - Hi Andrew, please hold off on the auto-configuration patch
for now, the sk group has identified a hotplug issue we need to work out
and we'll likely need to merge these two patch set together.  I really
appreciate your patience with this feature.

> Hi Gregory,
>
> In my understanding, the reason we are seeing 12 NUMA node is because
> it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
> in the code [1]  below:
> 
... snip ...

Appreciated, so yes this confirms what i thought was going on.  There's
4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
socket that is intended to interleave across the host bridges.

As you mention below, the code in acpi/numa/srat.c will create 1 NUMA
node per SRAT Memory Affinity Entry - and then also 1 NUMA node per
CFMWS that doesn't have a matching SRAT entry (with a known corner case
for a missing SRAT which doesn't apply here).

So essentialy what the system is doing is marking that it's absolutely
possible to create 1 region per device and also 1 region that
interleaves across host each pair of host bridges (I presume this is a
dual socket system?).

So, tl;dr: All these nodes are valid and this configuration is correct.

Weighted interleave presently works fine as intended, but with the
inclusion of the auto-configuration, there will be issues for your
system configuration. This means we probably need to consider
merging these as a group.

During boot, the following will occur

1) drivers/acpi/numa/srat.c marks 12 nodes as possible
   0-1) Socket nodes
   2-3) Cross-host-bridge interleave nodes
   4-11) single region nodes

2) drivers/cxl/* will probe the various devices and create
   a root decoder for each CXL Fixed Memory Window
   decoder0.0 - decoder11.0  (or maybe decoder0.0 - decoder0.11)

3) during probe auto-configuration of wieghted interleave occurs as a
   result of this code being called with hmat or cdat data:

void node_set_perf_attrs() {
...
	/* When setting CPU access coordinates, update mempolicy */
	if (access == ACCESS_COORDINATE_CPU) {
		if (mempolicy_set_node_perf(nid, coord)) {
			pr_info("failed to set mempolicy attrs for node %d\n",
				nid);
		}
	}
...
}

under the current system, since we calculate with N_POSSIBLE, all nodes
will be assigned weights (assuming HMAT or CDAT data is available for
all of them).

We actually have a few issues here

1) If all nodes are included in the weighting reduction, we're actually
   over-representing a particular set of hardware.  The interleave node
   and the individual device nodes would actually over-represent the
   bandwidth available (comparative to the CPU nodes).

2) As stated on this patch line, just switching to N_MEMORY causes
   issues with hotplug - where the bandwidth can be reported, but if
   memory hasn't been added yet then we'll end up with wrong weights
   because it wasn't included in the calculation.

3) However, not exposing the nodes because N_MEMORY isn't set yet
   a) prevents pre-configuration before memory is onlined, and
   b) hides the implications of hotplugging memory into a node from the
      user (adding memory causes a re-weight and may affect an
      interleave-all configuration).

but - i think it's reasonable that anyone using weighted-interleave is
*probably* not going to have nodes come and go.  It just seems like a
corner case that isn't reasonable to spend time supporting.

So coming back around to the hotplug patch line, I do think it's
reasonable hide nodes marked !N_MEMORY, but consider two issues:

1) In auto mode, we need to re-weight on hotplug to only include
   onlined nodes.  This is because the reduction may be sensitive
   to the available bandwidth changes.

   This behavior needs to be clearly documented.

2) We need to clearly define what the weight of a node will be when
   in manual mode and a node goes (memory -> no memory -> memory)
   a) does it retain it's old, manually set weight?
   b) does it revert to 1?

Sorry for the long email, just working through all the implications.

I think the proposed hotplug patch is a requirement for the
auto-configuration patch set.

~Gregory