[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8cnUA9WqsscbUtm@gourry-fedora-PF4VCD3F>
Date: Tue, 4 Mar 2025 11:16:16 -0500
From: Gregory Price <gourry@...rry.net>
To: Honggyu Kim <honggyu.kim@...com>
Cc: kernel_team@...ynix.com, Joshua Hahn <joshua.hahnjy@...il.com>,
harry.yoo@...cle.com, ying.huang@...ux.alibaba.com,
gregkh@...uxfoundation.org, rakie.kim@...com,
akpm@...ux-foundation.org, rafael@...nel.org, lenb@...nel.org,
dan.j.williams@...el.com, Jonathan.Cameron@...wei.com,
dave.jiang@...el.com, horen.chuang@...ux.dev, hannes@...xchg.org,
linux-kernel@...r.kernel.org, linux-acpi@...r.kernel.org,
linux-mm@...ck.org, kernel-team@...a.com, yunjeong.mun@...com
Subject: Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for
memoryless nodes
On Tue, Mar 04, 2025 at 10:03:22PM +0900, Honggyu Kim wrote:
> Hi Gregory,
>
> > This patch may have been a bit overzealous of us, I forgot to ask
> > whether N_MEMORY is set for nodes created but not onlined at boot. So
> > this is a good observation.
>
> I didn't want to make more noise but we found many issues again after
> getting a new machine and started using it with multiple CXL memory.
>
I spent yesterday looking into how nodes are created and marked N_MEMORY
and I think now that this patch is just not correct.
N_MEMORY for a given nid is toggled:
1) during mm_init if any page is associated with that node (DRAM)
2) memory_hotplug when a memory block is onlined/offlined (CXL)
This means a CXL node which is deferred to the driver will come up as
memoryless at boot (mm_init) but has N_MEMORY toggled on when the first
hotplug memory block is onlined. However, its access_coordinate data is
reported during cxl driver probe - well prior to memory hotplug.
This means we must expose a node entry for every possible node, always,
because we can't predict what nodes will have hotplug memory.
We COULD try to react to hotplug memory blocks, but this increase in
complexity just doesn't seem worth the hassle - the hotplug callback has
timing restrictions (callback must occur AFTER N_MEMORY is toggled).
It seems better to include all nodes with reported data in the reduction.
This has two downsides:
1) stale data may be used if hotplug occurs and the new device does
not have CDAT/HMAT/access_coordinate data.
2) any device without CDAT/HMAT/access_coordinate data will not be
included in the reduction by default.
I think we can work around #2 by detecting this (on reduction, if data
is missing but N_MEMORY is set, fire a warning). We can't do much about
#1 unless we field physical device hot-un-plug callbacks - and that
seems like a bit much.
~Gregory
Powered by blists - more mailing lists