[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20260204232754.9531-1-tomasz.wolski@fujitsu.com>
Date: Thu, 5 Feb 2026 00:27:54 +0100
From: Tomasz Wolski <tomasz.wolski@...itsu.com>
To: dan.j.williams@...el.com
Cc: Smita.KoralahalliChannabasappa@....com,
alison.schofield@...el.com,
ardb@...nel.org,
benjamin.cheatham@....com,
bp@...en8.de,
dave.jiang@...el.com,
dave@...olabs.net,
gregkh@...uxfoundation.org,
huang.ying.caritas@...il.com,
ira.weiny@...el.com,
jack@...e.cz,
jeff.johnson@....qualcomm.com,
jonathan.cameron@...wei.com,
len.brown@...el.com,
linux-cxl@...r.kernel.org,
linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org,
linux-pm@...r.kernel.org,
lizhijian@...itsu.com,
ming.li@...omail.com,
nathan.fontenot@....com,
nvdimm@...ts.linux.dev,
pavel@...nel.org,
peterz@...radead.org,
rafael@...nel.org,
rrichter@....com,
skoralah@....com,
terry.bowman@....com,
tomasz.wolski@...itsu.com,
vishal.l.verma@...el.com,
willy@...radead.org,
yaoxt.fnst@...itsu.com,
yazen.ghannam@....com
Subject: Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft Reserved memory ranges
>> > I was thinking through what Alison asked about what to do later in boot
>> > when other regions are being dynamically created. It made me wonder if
>> > this safety can be achieved more easily by just making sure that the
>> > alloc_dax_region() call fails.
>>
>> Agreed with all the points above, including making alloc_dax_region()
>> fail as the safety mechanism. This also cleanly avoids the no Soft
>> Reserved case Alison pointed out, where dax_cxl_mode can remain stuck in
>> DEFER and return -EPROBE_DEFER.
>>
>> What I’m still trying to understand is the case of “other regions being
>> dynamically created.” Once HMEM has claimed the relevant HPA range, any
>> later userspace attempts to create regions (via cxl create-region)
>> should naturally fail due to the existing HPA allocation. This already
>> shows up as an HPA allocation failure currently.
>>
>> #cxl create-region -d decoder0.0 -m mem2 -w 1 -g256
>> cxl region: create_region: region0: set_size failed: Numerical result
>> out of range
>> cxl region: cmd_create_region: created 0 regions
>>
>> And in the dmesg:
>> [ 466.819353] alloc_hpa: cxl region0: HPA allocation error (-34) for
>> size:0x0000002000000000 in CXL Window 0 [mem 0x850000000-0x284fffffff
>> flags 0x200]
>>
>> Also, at this point, with the probe-ordering fixes and the use of
>> wait_for_device_probe(), region probing should have fully completed.
>>
>> Am I missing any other scenario where regions could still be created
>> dynamically beyond this?
>
>The concern is what to do about regions and memory devices that are
>completely innocent. So, for example imagine deviceA is handled by BIOS
>and deviceB is ignored by BIOS. If deviceB was ignored by BIOS then it
>would be rude to tear down any regions that might be established for
>deviceB. So if alloc_dax_region() exclusion and HPA space reservation
>prevent future collisions while not disturbing innocent devices, then I
>think userspace can pick up the pieces from there.
I'm trying to follow the idea of "deviceB being ignored by BIOS"
Do you consider hot-plug devices and user creating reqions manually?
Could you please describe such scenario?
>> > Something like (untested / incomplete, needs cleanup handling!)
>> >
>> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
>> > index fde29e0ad68b..fd18343e0538 100644
>> > --- a/drivers/dax/bus.c
>> > +++ b/drivers/dax/bus.c
>> > @@ -10,6 +10,7 @@
>> > #include "dax-private.h"
>> > #include "bus.h"
>> >
>> > +static struct resource dax_regions = DEFINE_RES_MEM_NAMED(0, -1, "DAX Regions");
>> > static DEFINE_MUTEX(dax_bus_lock);
>> >
>> > /*
>> > @@ -661,11 +662,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
>> > dax_region->dev = parent;
>> > dax_region->target_node = target_node;
>> > ida_init(&dax_region->ida);
>> > - dax_region->res = (struct resource) {
>> > - .start = range->start,
>> > - .end = range->end,
>> > - .flags = IORESOURCE_MEM | flags,
>> > - };
>> > + dax_region->res = __request_region(&dax_regions, range->start, range->end, flags);
>> >
>> > if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) {
>> > kfree(dax_region);
>> >
>> > ...which will result in enforcing only one of dax_hmem or dax_cxl being
>> > able to register a dax_region.
>> >
>> > Yes, this would leave a mess of disabled cxl_dax_region devices lying
>> > around, but it would leave more breadcrumbs for debug, and reduce the
>> > number of races you need to worry about.
>> >
>> > In other words, I thought total teardown would be simpler, but as the
>> > feedback keeps coming in, I think that brings a different set of
>> > complexity. So just inject failures for dax_cxl to trip over and then we
>> > can go further later to effect total teardown if that proves to not be
>> > enough.
>>
>> One concern with the approach of not tearing down CXL regions is the
>> state it leaves behind in /proc/iomem. Soft Reserved ranges are
>> REGISTERed to HMEM while CXL regions remain present. The resulting
>> nesting (dax under region, region under window and window under SR)
>> visually suggests a coherent CXL hierarchy, even though ownership has
>> effectively moved to HMEM. When users, then attempt to tear regions down
>> and recreate them from userspace, they hit the same HPA allocation
>> failures described above.
>
>So this gets back to a question of do we really need "Soft Reserved" to
>show up in /proc/iomem? It is an ABI change to stop publishing it
>altogether, so at a minimum we need to be prepared to keep publishing it
>if it causes someone's working setup to regress.
>
>The current state of the for-7.0/cxl-init branch drops publishing "Soft
>Reserved". I am cautiously optimistic no one notices as long as DAX
>devices keep appearing, but at the first sign of regression we need a
>plan B.
>
>> If we decide not to tear down regions in the REGISTER case, should we
>> gate decoder resets during user initiated region teardown? Today,
>> decoders are reset when regions are torn down dynamically, and
>> subsequent attempts to recreate regions can trigger a large amount of
>> mailbox traffic. Much of what shows up as repeated “Reading event logs/
>> Clearing …” messages which ends up interleaved with the HPA allocation
>> failure, which can be confusing.
>
>One of the nice side effects of installing the "Soft Reserved" entries
>late, when HMEM takes over, is that they are easier to remove.
>
>So the flow would be, if you know what you are doing, is to disable the
>HMEM device which uninstalls the "Soft Reserved" entries, before trying
>to decommit the region and reclaim the HPA space.
>
Powered by blists - more mailing lists