[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8293a3bb-9a82-48d3-a011-bbab4e15a5b8@fujitsu.com>
Date: Thu, 21 Aug 2025 02:30:41 +0000
From: "Zhijian Li (Fujitsu)" <lizhijian@...itsu.com>
To: Alison Schofield <alison.schofield@...el.com>
CC: "dan.j.williams@...el.com" <dan.j.williams@...el.com>, Smita Koralahalli
<Smita.KoralahalliChannabasappa@....com>, "linux-cxl@...r.kernel.org"
<linux-cxl@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "nvdimm@...ts.linux.dev"
<nvdimm@...ts.linux.dev>, "linux-fsdevel@...r.kernel.org"
<linux-fsdevel@...r.kernel.org>, "linux-pm@...r.kernel.org"
<linux-pm@...r.kernel.org>, Davidlohr Bueso <dave@...olabs.net>, Jonathan
Cameron <jonathan.cameron@...wei.com>, Dave Jiang <dave.jiang@...el.com>,
Vishal Verma <vishal.l.verma@...el.com>, Ira Weiny <ira.weiny@...el.com>,
Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>, "Rafael J .
Wysocki" <rafael@...nel.org>, Len Brown <len.brown@...el.com>, Pavel Machek
<pavel@...nel.org>, Li Ming <ming.li@...omail.com>, Jeff Johnson
<jeff.johnson@....qualcomm.com>, Ying Huang <huang.ying.caritas@...il.com>,
"Xingtao Yao (Fujitsu)" <yaoxt.fnst@...itsu.com>, Peter Zijlstra
<peterz@...radead.org>, Greg KH <gregkh@...uxfoundation.org>, Nathan Fontenot
<nathan.fontenot@....com>, Terry Bowman <terry.bowman@....com>, Robert
Richter <rrichter@....com>, Benjamin Cheatham <benjamin.cheatham@....com>,
PradeepVineshReddy Kodamati <PradeepVineshReddy.Kodamati@....com>, "Yasunori
Gotou (Fujitsu)" <y-goto@...itsu.com>
Subject: Re: [PATCH v5 3/7] cxl/acpi: Add background worker to coordinate with
cxl_mem probe completion
On 21/08/2025 07:14, Alison Schofield wrote:
> On Tue, Aug 05, 2025 at 03:58:41AM +0000, Zhijian Li (Fujitsu) wrote:
>> Hi Dan and Smita,
>>
>>
>> On 24/07/2025 00:13, dan.j.williams@...el.com wrote:
>>> dan.j.williams@ wrote:
>>> [..]
>>>> If the goal is: "I want to give device-dax a point at which it can make
>>>> a go / no-go decision about whether the CXL subsystem has properly
>>>> assembled all CXL regions implied by Soft Reserved instersecting with
>>>> CXL Windows." Then that is something like the below, only lightly tested
>>>> and likely regresses the non-CXL case.
>>>>
>>>> -- 8< --
>>>> From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
>>>> From: Dan Williams <dan.j.williams@...el.com>
>>>> Date: Tue, 22 Jul 2025 16:11:08 -0700
>>>> Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
>>>
>>> Likely needs this incremental change to prevent DEV_DAX_HMEM from being
>>> built-in when CXL is not. This still leaves the awkward scenario of CXL
>>> enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
>>> safely fails in devdax only / fallback mode, but something to
>>> investigate when respinning on top of this.
>>>
>>
>> Thank you for your RFC; I find your proposal remarkably compelling, as it adeptly addresses the issues I am currently facing.
>>
>>
>> To begin with, I still encountered several issues with your patch (considering the patch at the RFC stage, I think it is already quite commendable):
>
> Hi Zhijian,
>
> Like you, I tried this RFC out. It resolved the issue of soft reserved
> resources preventing teardown and replacement of a region in place.
>
> I looked at the issues you found, and have some questions comments
> included below.
>
>>
>> 1. Some resources described by SRAT are wrongly identified as System RAM (kmem), such as the following: 200000000-5bffffff.
>>
>> ```
>> 200000000-5bffffff : dax6.0
>> 200000000-5bffffff : System RAM (kmem)
>> 5c0001128-5c00011b7 : port1
>> 5d0000000-64ffffff : CXL Window 0
>> 5d0000000-64ffffff : region0
>> 5d0000000-64ffffff : dax0.0
>> 5d0000000-64ffffff : System RAM (kmem)
>> 680000000-e7ffffff : PCI Bus 0000:00
>>
>> [root@...a-server ~]# dmesg | grep -i -e soft -e hotplug
>> [ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+ root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled panic_on_warn ignore_loglevel kasan.fault=panic
>> [ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft reserved
>> [ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff] soft reserved
>> [ 0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff] hotplug
>> ```
>
> Is that range also labelled as soft reserved?
> I ask, because I'm trying to draw a parallel between our test platforms.
No, It's not a soft reserved range. This can simply simulate with QEMU with `maxmem=192G` option(see below full qemu command line).
In my environment, `0x200000000-0x5bffffff` is something like [DRAM_END + 1, DRAM_END + maxmem - TOTAL_INSTALLED_DRAM_SIZE]
DRAM_END: end of the installed DRAM in Node 3
This range is reserved for the DRAM hot-add. In my case, it will be registered into 'HMEM devices' by calling hmem_register_resource in HMAT(drivers/acpi/numa/hmat.c)
893 static void hmat_register_target_devices(struct memory_target *target)
894 {
895 struct resource *res;
896
897 /*
898 * Do not bother creating devices if no driver is available to
899 * consume them.
900 */
901 if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
902 return;
903
904 for (res = target->memregions.child; res; res = res->sibling) {
905 int target_nid = pxm_to_node(target->memory_pxm);
906
907 hmem_register_resource(target_nid, res);
908 }
909 }
$ dmesg | grep -i -e soft -e hotplug -e Node
[ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan-00026-g1473b9914846-dirty root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 no_timer_check net.ifnames=0 console=tty1 conc
[ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft reserved
[ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064fffffff] soft reserved
[ 0.066332] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.067665] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
[ 0.068995] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x17fffffff]
[ 0.070359] ACPI: SRAT: Node 2 PXM 2 [mem 0x180000000-0x1bfffffff]
[ 0.071723] ACPI: SRAT: Node 3 PXM 3 [mem 0x1c0000000-0x1ffffffff]
[ 0.073085] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bfffffff] hotplug
[ 0.075689] NUMA: Node 0 [mem 0x00001000-0x0009ffff] + [mem 0x00100000-0x7fffffff] -> [mem 0x00001000-0x7fffffff]
[ 0.077849] NODE_DATA(0) allocated [mem 0x7ffb3e00-0x7ffdefff]
[ 0.079149] NODE_DATA(1) allocated [mem 0x17ffd1e00-0x17fffcfff]
[ 0.086077] Movable zone start for each node
[ 0.087054] Early memory node ranges
[ 0.087890] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.089264] node 0: [mem 0x0000000000100000-0x000000007ffdefff]
[ 0.090631] node 1: [mem 0x0000000100000000-0x000000017fffffff]
[ 0.092003] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffdefff]
[ 0.093532] Initmem setup node 1 [mem 0x0000000100000000-0x000000017fffffff]
[ 0.095164] Initmem setup node 2 as memoryless
[ 0.096281] Initmem setup node 3 as memoryless
[ 0.097397] Initmem setup node 4 as memoryless
[ 0.098444] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.099866] On node 0, zone DMA: 97 pages in unavailable ranges
[ 0.104342] On node 1, zone Normal: 33 pages in unavailable ranges
[ 0.126883] CPU topo: Allowing 4 present CPUs plus 0 hotplug CPUs
=================================
Please note that this is a modified QEMU.
/home/lizhijian/qemu/build-hmem/qemu-system-x86_64 -machine q35,accel=kvm,cxl=on,hmat=on \
-name guest-rdma-server -nographic -boot c \
-m size=6G,slots=2,maxmem=19922944k \
-hda /home/lizhijian/images/Fedora-rdma-server.qcow2 \
-object memory-backend-memfd,share=on,size=2G,id=m0 \
-object memory-backend-memfd,share=on,size=2G,id=m1 \
-numa node,nodeid=0,cpus=0-1,memdev=m0 \
-numa node,nodeid=1,cpus=2-3,memdev=m1 \
-smp 4,sockets=2,cores=2 \
-device pcie-root-port,id=pci-root,slot=8,bus=pcie.0,chassis=0 \
-device pxb-cxl,id=pxb-cxl-host-bridge,bus=pcie.0,bus_nr=0x35,hdm_for_passthrough=true \
-device cxl-rp,id=cxl-rp-hb-rp0,bus=pxb-cxl-host-bridge,chassis=0,slot=0,port=0 \
-device cxl-type3,bus=cxl-rp-hb-rp0,volatile-memdev=cxl-vmem0,id=cxl-vmem0,program-hdm-decoder=true \
-object memory-backend-file,id=cxl-vmem0,share=on,mem-path=/home/lizhijian/images/cxltest0.raw,size=2048M \
-M cxl-fmw.0.targets.0=pxb-cxl-host-bridge,cxl-fmw.0.size=2G,cxl-fmw.0.interleave-granularity=8k \
-nic bridge,br=virbr0,model=e1000,mac=52:54:00:c9:76:74 \
-bios /home/lizhijian/seabios/out/bios.bin \
-object memory-backend-memfd,share=on,size=1G,id=m2 \
-object memory-backend-memfd,share=on,size=1G,id=m3 \
-numa node,memdev=m2,nodeid=2 \
-numa node,memdev=m3,nodeid=3 \
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=21 \
-numa dist,src=0,dst=2,val=21 \
-numa dist,src=0,dst=3,val=21 \
-numa dist,src=1,dst=0,val=21 \
-numa dist,src=1,dst=1,val=10 \
-numa dist,src=1,dst=2,val=21 \
-numa dist,src=1,dst=3,val=21 \
-numa dist,src=2,dst=0,val=21 \
-numa dist,src=2,dst=1,val=21 \
-numa dist,src=2,dst=2,val=10 \
-numa dist,src=2,dst=3,val=21 \
-numa dist,src=3,dst=0,val=21 \
-numa dist,src=3,dst=1,val=21 \
-numa dist,src=3,dst=2,val=21 \
-numa dist,src=3,dst=3,val=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=110 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=240 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M \
-numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,latency=340 \
-numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M \
-numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-latency,latency=440 \
-numa hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M \
-numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,latency=240 \
-numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M \
-numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,latency=110 \
-numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M \
-numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,latency=340 \
-numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M \
-numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-latency,latency=440 \
-numa hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M
> I see -
>
> [] BIOS-e820: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
> .
> .
> [] reserve setup_data: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
> .
> .
> [] ACPI: SRAT: Node 6 PXM 14 [mem 0x24080000000-0x4407fffffff] hotplug
>
> /proc/iomem - as expected
> 24080000000-5f77fffffff : CXL Window 0
> 24080000000-4407fffffff : region0
> 24080000000-4407fffffff : dax0.0
> 24080000000-4407fffffff : System RAM (kmem)
>
>
> I'm also seeing this message:
> [] resource: Unaddressable device [mem 0x24080000000-0x4407fffffff] conflicts with [mem 0x24080000000-0x4407fffffff]
>
>>
>> 2. Triggers dev_warn and dev_err:
>>
>> ```
>> [root@...a-server ~]# journalctl -p err -p warning --dmesg
>> ...snip...
>> Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache calculation failed rc:-2
>> Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem failed with error -12
>> Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem failed with error -12
>> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0: 0x100000000-0x17ffffff could not reserve region
>> Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem failed with error -16
>
> I see the kmem dax messages also. It seems the kmem probe is going after
> every range (except hotplug) in the SRAT, and failing.
Yes, that's true, because current RFC removed the code that filters out the non-soft-reserverd resource. As a result, it will try to register dax/kmem for all of them while some of them has been marked as busy in the iomem_resource.
>> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> - IORES_DESC_SOFT_RESERVED);
>> - if (rc != REGION_INTERSECTS)
>> - return 0;
This is another example on my real *CXL HOST*:
Aug 19 17:59:05 kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measuremen>
Aug 19 17:59:09 kernel: power_meter ACPI000D:00: Ignoring unsafe software power cap!
Aug 19 17:59:09 kernel: kmem dax2.0: mapping0: 0x0-0x8fffffff could not reserve region
Aug 19 17:59:09 kernel: kmem dax2.0: probe with driver kmem failed with error -16
Aug 19 17:59:09 kernel: kmem dax3.0: mapping0: 0x100000000-0x86fffffff could not reserve region
Aug 19 17:59:09 kernel: kmem dax3.0: probe with driver kmem failed with error -16
Aug 19 17:59:09 kernel: kmem dax4.0: mapping0: 0x870000000-0x106fffffff could not reserve region
Aug 19 17:59:09 kernel: kmem dax4.0: probe with driver kmem failed with error -16
Aug 19 17:59:19 kernel: nvme nvme0: using unchecked data buffer
Aug 19 18:36:27 kernel: block nvme1n1: No UUID available providing old NGUID
lizhijian@:~$ sudo grep -w -e 106fffffff -e 870000000 -e 8fffffff -e 100000000 /proc/iomem
6fffb000-8fffffff : Reserved
100000000-10000ffff : Reserved
106ccc0000-106fffffff : Reserved
This issue can be resolved by re-introducing sort_reserved_region_intersects(...) I guess.
>
>> ```
>>
>> 3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in which case only CXL Window X is visible.
>
> Haven't tested !CXL_REGION yet.
>
>>
>> On failure:
>>
>> ```
>> 100000000-27ffffff : System RAM
>> 5c0001128-5c00011b7 : port1
>> 5c0011128-5c00111b7 : port2
>> 5d0000000-6cffffff : CXL Window 0
>> 6d0000000-7cffffff : CXL Window 1
>> 7000000000-700000ffff : PCI Bus 0000:0c
>> 7000000000-700000ffff : 0000:0c:00.0
>> 7000001080-70000010d7 : mem1
>> ```
>>
>> On success:
>>
>> ```
>> 5d0000000-7cffffff : dax0.0
>> 5d0000000-7cffffff : System RAM (kmem)
>> 5d0000000-6cffffff : CXL Window 0
>> 6d0000000-7cffffff : CXL Window 1
>> ```
>>
>> In term of issues 1 and 2, this arises because hmem_register_device() attempts to register resources of all "HMEM devices," whereas we only need to register the IORES_DESC_SOFT_RESERVED resources. I believe resolving the current TODO will address this.
>>
>> ```
>> - rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
>> - IORES_DESC_SOFT_RESERVED);
>> - if (rc != REGION_INTERSECTS)
>> - return 0;
>> + /* TODO: insert "Soft Reserved" into iomem here */
>> ```
>
> Above makes sense.
I think the subroutine add_soft_reserved() in your previous patchset[1] are able to cover this TODO
>
> I'll probably wait for an update from Smita to test again, but if you
> or Smita have anything you want me to try out on my hardwware in the
> meantime, let me know.
>
Here is my local fixup based on Dan's RFC, it can resovle issue 1 and 2.
-- 8< --
commit e7ccd7a01e168e185971da66f4aa13eb451caeaf
Author: Li Zhijian <lizhijian@...itsu.com>
Date: Fri Aug 20 11:07:15 2025 +0800
Fix probe-order TODO
Signed-off-by: Li Zhijian <lizhijian@...itsu.com>
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 754115da86cc..965ffc622136 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -93,6 +93,26 @@ static void process_defer_work(struct work_struct *_work)
walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
}
+static int add_soft_reserved(resource_size_t start, resource_size_t len,
+ unsigned long flags)
+{
+ struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL);
+ int rc;
+
+ if (!res)
+ return -ENOMEM;
+
+ *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved",
+ flags | IORESOURCE_MEM,
+ IORES_DESC_SOFT_RESERVED);
+
+ rc = insert_resource(&iomem_resource, res);
+ if (rc)
+ kfree(res);
+
+ return rc;
+}
+
static int hmem_register_device(struct device *host, int target_nid,
const struct resource *res)
{
@@ -102,6 +122,10 @@ static int hmem_register_device(struct device *host, int target_nid,
long id;
int rc;
+ if (soft_reserve_res_intersects(res->start, resource_size(res),
+ IORESOURCE_MEM, IORES_DESC_NONE) == REGION_DISJOINT)
+ return 0;
+
if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {
@@ -119,7 +143,17 @@ static int hmem_register_device(struct device *host, int target_nid,
}
}
- /* TODO: insert "Soft Reserved" into iomem here */
+ /*
+ * This is a verified Soft Reserved region that CXL is not claiming (or
+ * is being overridden). Add it to the main iomem tree so it can be
+ * properly reserved by the DAX driver.
+ */
+ rc = add_soft_reserved(res->start, res->end - res->start + 1, 0);
+ if (rc) {
+ dev_warn(host, "failed to insert soft-reserved resource %pr into iomem: %d\n",
+ res, rc);
+ return rc;
+ }
id = memregion_alloc(GFP_KERNEL);
if (id < 0) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 349f0d9aad22..eca5956c444b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1069,6 +1069,8 @@ enum {
int region_intersects(resource_size_t offset, size_t size, unsigned long flags,
unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t offset, size_t size, unsigned long flags,
+ unsigned long desc);
/* Support for virtually mapped pages */
struct page *vmalloc_to_page(const void *addr);
unsigned long vmalloc_to_pfn(const void *addr);
diff --git a/kernel/resource.c b/kernel/resource.c
index b8eac6af2fad..a34b76cf690a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -461,6 +461,22 @@ int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
arg, func);
}
EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
+
+static int __region_intersects(struct resource *parent, resource_size_t start,
+ size_t size, unsigned long flags,
+ unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t start, size_t size, unsigned long flags,
+ unsigned long desc)
+{
+ int ret;
+
+ read_lock(&resource_lock);
+ ret = __region_intersects(&soft_reserve_resource, start, size, flags, desc);
+ read_unlock(&resource_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(soft_reserve_res_intersects);
#endif
/*
[1] https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofield@intel.com/
> -- Alison
>
>
>>
>> Regarding issue 3 (which exists in the current situation), this could be because it cannot ensure that dax_hmem_probe() executes prior to cxl_acpi_probe() when CXL_REGION is disabled.
>>
>> I am pleased that you have pushed the patch to the cxl/for-6.18/cxl-probe-order branch, and I'm looking forward to its integration into the upstream during the v6.18 merge window.
>> Besides the current TODO, you also mentioned that this RFC PATCH must be further subdivided into several patches, so there remains significant work to be done.
>> If my understanding is correct, you would be personally continuing to push forward this patch, right?
>>
>>
>> Smita,
>>
>> Do you have any additional thoughts on this proposal from your side?
>>
>>
>> Thanks
>> Zhijian
>>
> snip
>
Powered by blists - more mailing lists