linux-kernel - Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250311095151.446-1-yunjeong.mun@sk.com>
Date: Tue, 11 Mar 2025 18:51:45 +0900
From: Yunjeong Mun <yunjeong.mun@...com>
To: Gregory Price <gourry@...rry.net>
Cc: kernel_team@...ynix.com,
	Joshua Hahn <joshua.hahnjy@...il.com>,
	harry.yoo@...cle.com,
	ying.huang@...ux.alibaba.com,
	gregkh@...uxfoundation.org,
	rakie.kim@...com,
	akpm@...ux-foundation.org,
	rafael@...nel.org,
	lenb@...nel.org,
	dan.j.williams@...el.com,
	Jonathan.Cameron@...wei.com,
	dave.jiang@...el.com,
	horen.chuang@...ux.dev,
	hannes@...xchg.org,
	linux-kernel@...r.kernel.org,
	linux-acpi@...r.kernel.org,
	linux-mm@...ck.org,
	kernel-team@...a.com,
	Honggyu Kim <honggyu.kim@...com>
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

Hi Gregory, thanks for kind explanation.

On Tue, 11 Mar 2025 00:42:49 -0400 Gregory Price <gourry@...rry.net> wrote:
> On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:
> 
> forenote - Hi Andrew, please hold off on the auto-configuration patch
> for now, the sk group has identified a hotplug issue we need to work out
> and we'll likely need to merge these two patch set together.  I really
> appreciate your patience with this feature.
> 
> > Hi Gregory,
> >
> > In my understanding, the reason we are seeing 12 NUMA node is because
> > it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
> > in the code [1]  below:
> > 
> ... snip ...
> 
> Appreciated, so yes this confirms what i thought was going on.  There's
> 4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
> socket that is intended to interleave across the host bridges.
> 
> As you mention below, the code in acpi/numa/srat.c will create 1 NUMA
> node per SRAT Memory Affinity Entry - and then also 1 NUMA node per
> CFMWS that doesn't have a matching SRAT entry (with a known corner case
> for a missing SRAT which doesn't apply here).
> 
> So essentialy what the system is doing is marking that it's absolutely
> possible to create 1 region per device and also 1 region that
> interleaves across host each pair of host bridges (I presume this is a
> dual socket system?).

Correct, it is a dual socket system. Thank you for the detailed explanation.
It has been helpful for analyzing the code.
> 
> So, tl;dr: All these nodes are valid and this configuration is correct.
> 
> Weighted interleave presently works fine as intended, but with the
> inclusion of the auto-configuration, there will be issues for your
> system configuration. This means we probably need to consider
> merging these as a group.
> 

We believe our propsed hot plug patch should be added as a hot-fix to v6.14, 
because it can be addressed independently of auto-configuring feature.
Rakie will send v2 patch soon.

> During boot, the following will occur
> 
> 1) drivers/acpi/numa/srat.c marks 12 nodes as possible
>    0-1) Socket nodes
>    2-3) Cross-host-bridge interleave nodes
>    4-11) single region nodes
> 
> 2) drivers/cxl/* will probe the various devices and create
>    a root decoder for each CXL Fixed Memory Window
>    decoder0.0 - decoder11.0  (or maybe decoder0.0 - decoder0.11)
> 
> 3) during probe auto-configuration of wieghted interleave occurs as a
>    result of this code being called with hmat or cdat data:
> 
> void node_set_perf_attrs() {
> ...
> 	/* When setting CPU access coordinates, update mempolicy */
> 	if (access == ACCESS_COORDINATE_CPU) {
> 		if (mempolicy_set_node_perf(nid, coord)) {
> 			pr_info("failed to set mempolicy attrs for node %d\n",
> 				nid);
> 		}
> 	}
> ...
> }
> 
> under the current system, since we calculate with N_POSSIBLE, all nodes
> will be assigned weights (assuming HMAT or CDAT data is available for
> all of them).
> 
> We actually have a few issues here
> 
> 1) If all nodes are included in the weighting reduction, we're actually
>    over-representing a particular set of hardware.  The interleave node
>    and the individual device nodes would actually over-represent the
>    bandwidth available (comparative to the CPU nodes).
> 
> 2) As stated on this patch line, just switching to N_MEMORY causes
>    issues with hotplug - where the bandwidth can be reported, but if
>    memory hasn't been added yet then we'll end up with wrong weights
>    because it wasn't included in the calculation.
> 
> 3) However, not exposing the nodes because N_MEMORY isn't set yet
>    a) prevents pre-configuration before memory is onlined, and
>    b) hides the implications of hotplugging memory into a node from the
>       user (adding memory causes a re-weight and may affect an
>       interleave-all configuration).
> 
> but - i think it's reasonable that anyone using weighted-interleave is
> *probably* not going to have nodes come and go.  It just seems like a
> corner case that isn't reasonable to spend time supporting.
> 
> So coming back around to the hotplug patch line, I do think it's
> reasonable hide nodes marked !N_MEMORY, but consider two issues:
> 
> 1) In auto mode, we need to re-weight on hotplug to only include
>    onlined nodes.  This is because the reduction may be sensitive
>    to the available bandwidth changes.
> 
>    This behavior needs to be clearly documented.
> 
> 2) We need to clearly define what the weight of a node will be when
>    in manual mode and a node goes (memory -> no memory -> memory)
>    a) does it retain it's old, manually set weight?
>    b) does it revert to 1?
> 
> Sorry for the long email, just working through all the implications.
> 
> I think the proposed hotplug patch is a requirement for the
> auto-configuration patch set.
> 
> ~Gregory
> 
Best regards,
Yunjeong