[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260206162338.000035c8@huawei.com>
Date: Fri, 6 Feb 2026 16:23:38 +0000
From: Jonathan Cameron <jonathan.cameron@...wei.com>
To: Andrew Morton <akpm@...ux-foundation.org>
CC: Gregory Price <gourry@...rry.net>, Cui Chao <cuichao1753@...tium.com.cn>,
<dan.j.williams@...el.com>, Mike Rapoport <rppt@...nel.org>, Wang Yinfeng
<wangyinfeng@...tium.com.cn>, <linux-cxl@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
<qemu-devel@...gnu.org>, "David Hildenbrand (Arm)" <david@...nel.org>
Subject: Re: [PATCH v2 1/1] mm: numa_memblks: Identify the accurate NUMA ID
of CFMW
On Fri, 6 Feb 2026 07:57:09 -0800
Andrew Morton <akpm@...ux-foundation.org> wrote:
> On Fri, 6 Feb 2026 15:09:41 +0000 Jonathan Cameron <jonathan.cameron@...wei.com> wrote:
>
> > > Andrew if Jonathan is good with it then with changelog updates this can
> > > go in, otherwise I don't think this warrants a backport or anything.
> >
> > Wait and see if anyone hits it on a real machine (or even non creative QEMU
> > setup!) So for now no need to backport.
>
> Thanks, all.
>
> Below is the current state of this patch. Is the changelog suitable?
Hi Andrew
Not quite..
>
>
> From: Cui Chao <cuichao1753@...tium.com.cn>
> Subject: mm: numa_memblks: identify the accurate NUMA ID of CFMW
> Date: Tue, 6 Jan 2026 11:10:42 +0800
>
> In some physical memory layout designs, the address space of CFMW (CXL
> Fixed Memory Window) resides between multiple segments of system memory
> belonging to the same NUMA node. In numa_cleanup_meminfo, these multiple
> segments of system memory are merged into a larger numa_memblk. When
> identifying which NUMA node the CFMW belongs to, it may be incorrectly
> assigned to the NUMA node of the merged system memory.
>
> When a CXL RAM region is created in userspace, the memory capacity of
> the newly created region is not added to the CFMW-dedicated NUMA node.
> Instead, it is accumulated into an existing NUMA node (e.g., NUMA0
> containing RAM). This makes it impossible to clearly distinguish
> between the two types of memory, which may affect memory-tiering
> applications.
>
> Example memory layout:
>
> Physical address space:
> 0x00000000 - 0x1FFFFFFF System RAM (node0)
> 0x20000000 - 0x2FFFFFFF CXL CFMW (node2)
> 0x40000000 - 0x5FFFFFFF System RAM (node0)
> 0x60000000 - 0x7FFFFFFF System RAM (node1)
>
> After numa_cleanup_meminfo, the two node0 segments are merged into one:
> 0x00000000 - 0x5FFFFFFF System RAM (node0) // CFMW is inside the range
> 0x60000000 - 0x7FFFFFFF System RAM (node1)
>
> So the CFMW (0x20000000-0x2FFFFFFF) will be incorrectly assigned to node0.
>
> To address this scenario, accurately identifying the correct NUMA node
> can be achieved by checking whether the region belongs to both
> numa_meminfo and numa_reserved_meminfo.
>
>
> 1. Issue Impact and Backport Recommendation:
>
> This patch fixes an issue on hardware platforms (not QEMU emulation)
I think this bit turned out to not be a bit misleading. Cui Chao
clarified in:
https://lore.kernel.org/all/a90bc6f2-105c-4ffc-99d9-4fa5eaa79c45@phytium.com.cn/
"This issue was discovered on the QEMU platform. I need to apologize for
my earlier imprecise statement (claiming it was hardware instead of
QEMU). My core point at the time was to emphasize that this is a problem
in the general code path when facing this scenario, not a QEMU-specific
emulation issue, and therefore it could theoretically affect real
hardware as well. I apologize for any confusion this may have caused."
So, whilst this could happen on a real hardware platform, for now we aren't
aware of a suitable configuration actually happening. I'm not sure we can
even create it in in QEMU without some tweaks.
Other than relaxing this to perhaps say that a hardware platform 'might'
have a configuration like the description here looks good to me.
Thanks!
Jonathan
> where, during the dynamic creation of a CXL RAM region, the memory
> capacity is not assigned to the correct CFMW-dedicated NUMA node. This
> issue leads to:
>
> Failure of the memory tiering mechanism: The system is designed to
> treat System RAM as fast memory and CXL memory as slow memory. For
> performance optimization, hot pages may be migrated to fast memory
> while cold pages are migrated to slow memory. The system uses NUMA
> IDs as an index to identify different tiers of memory. If the NUMA
> ID for CXL memory is calculated incorrectly and its capacity is
> aggregated into the NUMA node containing System RAM (i.e., the node
> for fast memory), the CXL memory cannot be correctly identified. It
> may be misjudged as fast memory, thereby affecting performance
> optimization strategies.
>
> Inability to distinguish between System RAM and CXL memory even for
> simple manual binding: Tools like |numactl|and other NUMA policy
> utilities cannot differentiate between System RAM and CXL memory,
> making it impossible to perform reasonable memory binding.
>
> Inaccurate system reporting: Tools like |numactl -H|would display
> memory capacities that do not match the actual physical hardware
> layout, impacting operations and monitoring.
>
> This issue affects all users utilizing the CXL RAM functionality who
> rely on memory tiering or NUMA-aware scheduling. Such configurations
> are becoming increasingly common in data centers, cloud computing, and
> high-performance computing scenarios.
>
> Therefore, I recommend backporting this patch to all stable kernel
> series that support dynamic CXL region creation.
>
> 2. Why a Kernel Update is Recommended Over a Firmware Update:
>
> In the scenario of dynamic CXL region creation, the association between
> the memory's HPA range and its corresponding NUMA node is established
> when the kernel driver performs the commit operation. This is a
> runtime, OS-managed operation where the platform firmware cannot
> intervene to provide a fix.
>
> Considering factors like hardware platform architecture, memory
> resources, and others, such a physical address layout can indeed occur.
> This patch does not introduce risk; it simply correctly handles the
> NUMA node assignment for CXL RAM regions within such a physical address
> layout.
>
> Thus, I believe a kernel fix is necessary.
>
> Link: https://lkml.kernel.org/r/20260106031042.1606729-2-cuichao1753@phytium.com.cn
> Fixes: 779dd20cfb56 ("cxl/region: Add region creation support")
> Signed-off-by: Cui Chao <cuichao1753@...tium.com.cn>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@...wei.com>
> Cc: Mike Rapoport <rppt@...nel.org>
> Cc: Wang Yinfeng <wangyinfeng@...tium.com.cn>
> Cc: Dan Williams <dan.j.williams@...el.com>
> Cc: Gregory Price <gourry@...rry.net>
> Cc: Joanthan Cameron <Jonathan.Cameron@...wei.com>
> Cc: Wang Yinfeng <wangyinfeng@...tium.com.cn>
> Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
> ---
>
> mm/numa_memblks.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> --- a/mm/numa_memblks.c~mm-numa_memblks-identify-the-accurate-numa-id-of-cfmw
> +++ a/mm/numa_memblks.c
> @@ -570,15 +570,16 @@ static int meminfo_to_nid(struct numa_me
> int phys_to_target_node(u64 start)
> {
> int nid = meminfo_to_nid(&numa_meminfo, start);
> + int reserved_nid = meminfo_to_nid(&numa_reserved_meminfo, start);
>
> /*
> * Prefer online nodes, but if reserved memory might be
> * hot-added continue the search with reserved ranges.
> */
> - if (nid != NUMA_NO_NODE)
> + if (nid != NUMA_NO_NODE && reserved_nid == NUMA_NO_NODE)
> return nid;
>
> - return meminfo_to_nid(&numa_reserved_meminfo, start);
> + return reserved_nid;
> }
> EXPORT_SYMBOL_GPL(phys_to_target_node);
>
> _
>
>
Powered by blists - more mailing lists