linux-kernel - Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z8cnUA9WqsscbUtm@gourry-fedora-PF4VCD3F>
Date: Tue, 4 Mar 2025 11:16:16 -0500
From: Gregory Price <gourry@...rry.net>
To: Honggyu Kim <honggyu.kim@...com>
Cc: kernel_team@...ynix.com, Joshua Hahn <joshua.hahnjy@...il.com>,
	harry.yoo@...cle.com, ying.huang@...ux.alibaba.com,
	gregkh@...uxfoundation.org, rakie.kim@...com,
	akpm@...ux-foundation.org, rafael@...nel.org, lenb@...nel.org,
	dan.j.williams@...el.com, Jonathan.Cameron@...wei.com,
	dave.jiang@...el.com, horen.chuang@...ux.dev, hannes@...xchg.org,
	linux-kernel@...r.kernel.org, linux-acpi@...r.kernel.org,
	linux-mm@...ck.org, kernel-team@...a.com, yunjeong.mun@...com
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for
 memoryless nodes

On Tue, Mar 04, 2025 at 10:03:22PM +0900, Honggyu Kim wrote:
> Hi Gregory,
> 
> > This patch may have been a bit overzealous of us, I forgot to ask
> > whether N_MEMORY is set for nodes created but not onlined at boot. So
> > this is a good observation.
> 
> I didn't want to make more noise but we found many issues again after
> getting a new machine and started using it with multiple CXL memory.
> 

I spent yesterday looking into how nodes are created and marked N_MEMORY
and I think now that this patch is just not correct.

N_MEMORY for a given nid is toggled:
  1) during mm_init if any page is associated with that node (DRAM)
  2) memory_hotplug when a memory block is onlined/offlined  (CXL)

This means a CXL node which is deferred to the driver will come up as
memoryless at boot (mm_init) but has N_MEMORY toggled on when the first
hotplug memory block is onlined.  However, its access_coordinate data is
reported during cxl driver probe - well prior to memory hotplug.

This means we must expose a node entry for every possible node, always,
because we can't predict what nodes will have hotplug memory.

We COULD try to react to hotplug memory blocks, but this increase in
complexity just doesn't seem worth the hassle - the hotplug callback has
timing restrictions (callback must occur AFTER N_MEMORY is toggled).

It seems better to include all nodes with reported data in the reduction.

This has two downsides:
  1) stale data may be used if hotplug occurs and the new device does
     not have CDAT/HMAT/access_coordinate data.
  2) any device without CDAT/HMAT/access_coordinate data will not be
     included in the reduction by default.

I think we can work around #2 by detecting this (on reduction, if data
is missing but N_MEMORY is set, fire a warning).  We can't do much about
#1 unless we field physical device hot-un-plug callbacks - and that
seems like a bit much.

~Gregory