[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6930b447c48d6_198110029@dwillia2-mobl4.notmuch>
Date: Wed, 3 Dec 2025 14:05:59 -0800
From: <dan.j.williams@...el.com>
To: Tomasz Wolski <tomasz.wolski@...itsu.com>, <alison.schofield@...el.com>
CC: <Smita.KoralahalliChannabasappa@....com>, <ardb@...nel.org>,
<benjamin.cheatham@....com>, <bp@...en8.de>, <dan.j.williams@...el.com>,
<dave.jiang@...el.com>, <dave@...olabs.net>, <gregkh@...uxfoundation.org>,
<huang.ying.caritas@...il.com>, <ira.weiny@...el.com>, <jack@...e.cz>,
<jeff.johnson@....qualcomm.com>, <jonathan.cameron@...wei.com>,
<len.brown@...el.com>, <linux-cxl@...r.kernel.org>,
<linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<linux-pm@...r.kernel.org>, <lizhijian@...itsu.com>, <ming.li@...omail.com>,
<nathan.fontenot@....com>, <nvdimm@...ts.linux.dev>, <pavel@...nel.org>,
<peterz@...radead.org>, <rafael@...nel.org>, <rrichter@....com>,
<terry.bowman@....com>, <vishal.l.verma@...el.com>, <willy@...radead.org>,
<yaoxt.fnst@...itsu.com>, <yazen.ghannam@....com>
Subject: Re: [PATCH v4 0/9] dax/hmem, cxl: Coordinate Soft Reserved handling
with CXL and HMEM
Tomasz Wolski wrote:
[..]
>
> Hello Smita, Alison
>
> I did some testing and came across issues with probe order so I applied the
> three patches mentioned by Smita + fix for the NULL dereference.
> I noticed issues in scenario 3.1 and 4 below but maybe they are related to
> the test setup:
BTW, thanks for all these tests, it helps!
> [1] QEMU: 1 CFMWS + Host-bridge + 1 CXL device
> Soft reserve in not seen in the iomem:
>
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
>
> kernel: [ 0.000000][ T0] BIOS-e820: [mem 0x0000000a90000000-0x0000000b8fffffff] soft reserved
>
> == region teardown
> a90000000-b8fffffff : CXL Window 0
> // no dax devices
>
> == region recreate
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
>
> == booted with no PCI attached
> a90000000-b8fffffff : Soft Reserved
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : dax1.0
> a90000000-b8fffffff : System RAM (kmem)
So this is the expected behavior with the proposal that if the device and
memory is present at boot, but the driver is disabled or fails to
assemble, then the solution falls back to "region-less" dax.
> == ..and hot plug via QEMU terminal => is the following iomem tree expected?
> a90000000-b8fffffff : Soft Reserved
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax1.0
> a90000000-b8fffffff : System RAM (kmem)
Unless I am missing something, this looks like a bug in the test because
if you are truly hot-adding the device after boot, then the BIOS would
have had no reason/chance to create that Soft Reserved entry. The
assumption is that the presence of Soft Reserved always implies that it
was mapping physical hardware that was present at boot.
> kernel: [ 129.820136][ T65] cxl_acpi ACPI0017:00: decoder0.0: created region0
> ..
> kernel: [ 129.827126][ T65] cxl_region region0: [mem 0xa90000000-0xb8fffffff flags 0x200] has System RAM: [mem 0xa90000000-0xb8fffffff flags 0x83000200]
>
> [1.1] QEMU: 1 CFMWS + Host-bridge + 1 CXL device
> Region is smaller than SR - hmem claims the space
The expectation is that a configuration like this is out of scope.
Unless CXL fully covers a Soft Reserved entry it must assume that the
platform is doing something custom / special and disable the CXL
subsystem in its entirety.
The simplifying hope is that there is always a 1:1 correlation between
CXL Region and ACPI SRAT/HMAT range entries such that a Soft Reserved
resource is never misaligned to a CXL region.
Hmm, this might highlight a gap in the implementation. I think we need
to make sure that drivers/acpi/numa/hmat.c::alloc_memory_target()
injects boundaries into soft_reserve_resource. I.e. I think a BIOS might
create a merged EFI memory map entry that spans multiple SRAT/HMAT
ranges in the same proximity domain.
It would be lovely to require BIOS to bound their descriptions on CXL
region boundaries.
> a90000000-bcfffffff : Soft Reserved
> a90000000-bcfffffff : CXL Window 0
> a90000000-bcfffffff : dax1.0
> a90000000-bcfffffff : System RAM (kmem)
>
> [2] QEMU: 1 CFMWS + Host-bridge + 2 CXL devices
>
> kernel: [ 0.000000][ T0] BIOS-e820: [mem 0x0000000a90000000-0x0000000c8fffffff] soft reserved
>
> a90000000-c8fffffff : CXL Window 0
> a90000000-b8fffffff : region1
> a90000000-b8fffffff : dax1.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : region0
> b90000000-c8fffffff : dax0.0
> b90000000-c8fffffff : System RAM (kmem)
Wait, you have a CXL region that partially overlaps a Soft Reserved
range? That does not look a configuration the subsystem could ever
support and should fallback to disabling CXL.
> == region1 teardown
> a90000000-c8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
>
> == recreate region1 - created in correct address range
>
> a90000000-c8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : region1
> b90000000-c8fffffff : dax1.0
> b90000000-c8fffffff : System RAM (kmem)
>
> [2.1] QEMU: 1 CFMWS + Host-bridge + 2 CXL devices
> Region is smaller than SR - hmem claims the whole space
>
> kernel: [ 0.000000][ T0] BIOS-e820: [mem 0x0000000a90000000-0x0000000ccfffffff] soft reserved
>
> a90000000-ccfffffff : Soft Reserved
> a90000000-ccfffffff : CXL Window 0
> a90000000-ccfffffff : dax1.0
> a90000000-ccfffffff : System RAM (kmem)
>
> [3] QEMU: 2 CFMWS + Host-bridge + 2 CXL devices
>
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : CXL Window 1
> b90000000-c8fffffff : region1
> b90000000-c8fffffff : dax1.0
> b90000000-c8fffffff : System RAM (kmem)
>
> == Tearing down region 1
>
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : CXL Window 1
>
> == Recreate region 1
> a90000000-b8fffffff : CXL Window 0
> a90000000-b8fffffff : region0
> a90000000-b8fffffff : dax0.0
> a90000000-b8fffffff : System RAM (kmem)
> b90000000-c8fffffff : CXL Window 1
> b90000000-c8fffffff : region1
> b90000000-c8fffffff : dax1.0
> b90000000-c8fffffff : System RAM (kmem)
>
> [3.1] QEMU: 2 CFMWS + Host-bridge + 2 CXL devices
> Region does not span whole CXL Window - hmem should claim the whole space, but kmem failed with EBUSY
>
> a90000000-ccfffffff : Soft Reserved
> a90000000-bcfffffff : CXL Window 0
> bd0000000-ccfffffff : CXL Window 1
Again, we do not expect that a real world BIOS would ever present this.
It might be the case that there is a single EFI entry that covers
a90000000-ccfffffff, but the expectation is that SRAT would have
separate entries for a90000000-bcfffffff and bd0000000-ccfffffff so that
everything lines up.
For simplicity I want the fallback to be all or nothing because either
there is full confidence that the CXL Subsystem understands the
configuration, or there is zero confidence. Leave no room for complex
"partial assembly" configurations to debug.
[..]
>
> [4] Physical machine: 2 CFMWS + Host-bridge + 2 CXL devices
>
> kernel: BIOS-e820: [mem 0x0000002070000000-0x000000a06fffffff] soft reserved
>
> 2070000000-606fffffff : CXL Window 0
> 2070000000-606fffffff : region0
> 2070000000-606fffffff : dax0.0
> 2070000000-606fffffff : System RAM (kmem)
> 6070000000-a06fffffff : CXL Window 1
> 6070000000-a06fffffff : region1
> 6070000000-a06fffffff : dax1.0
> 6070000000-a06fffffff : System RAM (kmem)
Ok, so a real world maching that creates a merged
0x0000002070000000-0x000000a06fffffff range. Can you confirm that the
SRAT has separate entries for those ranges? Otherwise, need to rethink
how to keep this fallback algorithm simple and predictable.
> kernel: BIOS-e820: [mem 0x0000002070000000-0x000000a06fffffff] soft reserved
>
> == region 1 teardown and unplug (the unplug was done via ubind/remove in /sys/bus/pci/devices)
Note that you need to explicitly destroy the region for the physical
removal case. Otherwise, decoders stay committed throughout the
hierarchy. Simple unbind / PCI device removal does not manage CXL
decoders.
>
> 2070000000-606fffffff : CXL Window 0
> 2070000000-606fffffff : region0
> 2070000000-606fffffff : dax0.0
> 2070000000-606fffffff : System RAM (kmem)
> 6070000000-a06fffffff : CXL Window 1
>
> == plug - after PCI rescan cannot create hmem
> 6070000000-a06fffffff : CXL Window 1
> 6070000000-a06fffffff : region1
>
> kernel: cxl_region region1: config state: 0
> kernel: cxl_acpi ACPI0017:00: decoder0.1: created region1
> kernel: cxl_pci 0000:04:00.0: mem1:decoder10.0: __construct_region region1 res: [mem 0x6070000000-0xa06fffffff flags 0x200] iw: 1 ig: 4096
> kernel: cxl_mem mem1: decoder:decoder10.0 parent:0000:04:00.0 port:endpoint10 range:0x6070000000-0xa06fffffff pos:0
> kernel: cxl region1: region sort successful
> kernel: cxl region1: mem1:endpoint10 decoder10.0 add: mem1:decoder10.0 @ 0 next: none nr_eps: 1 nr_targets: 1
> kernel: cxl region1: pci0000:00:port2 decoder2.1 add: mem1:decoder10.0 @ 0 next: mem1 nr_eps: 1 nr_targets: 1
> kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets expected iw: 1 ig: 4096 [mem 0x6070000000-0xa06fffffff flags 0x200]
> kernel: cxl region1: pci0000:00:port2 cxl_port_setup_targets got iw: 1 ig: 256 state: disabled 0x6070000000:0xa06fffffff
Did the device get reset in the process? This looks like decoders
bounced in an inconsistent fashion from unplug to replug and
autodiscovery.
Powered by blists - more mailing lists