lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <697aad9546542_30951007c@dwillia2-mobl4.notmuch>
Date: Wed, 28 Jan 2026 16:45:09 -0800
From: <dan.j.williams@...el.com>
To: "Koralahalli Channabasappa, Smita" <skoralah@....com>, Alison Schofield
	<alison.schofield@...el.com>
CC: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>,
	<linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<nvdimm@...ts.linux.dev>, <linux-fsdevel@...r.kernel.org>,
	<linux-pm@...r.kernel.org>, Ard Biesheuvel <ardb@...nel.org>, Vishal Verma
	<vishal.l.verma@...el.com>, Ira Weiny <ira.weiny@...el.com>, Dan Williams
	<dan.j.williams@...el.com>, Jonathan Cameron <jonathan.cameron@...wei.com>,
	Yazen Ghannam <yazen.ghannam@....com>, Dave Jiang <dave.jiang@...el.com>,
	Davidlohr Bueso <dave@...olabs.net>, Matthew Wilcox <willy@...radead.org>,
	Jan Kara <jack@...e.cz>, "Rafael J . Wysocki" <rafael@...nel.org>, Len Brown
	<len.brown@...el.com>, Pavel Machek <pavel@...nel.org>, Li Ming
	<ming.li@...omail.com>, Jeff Johnson <jeff.johnson@....qualcomm.com>, "Ying
 Huang" <huang.ying.caritas@...il.com>, Yao Xingtao <yaoxt.fnst@...itsu.com>,
	Peter Zijlstra <peterz@...radead.org>, Greg Kroah-Hartman
	<gregkh@...uxfoundation.org>, Nathan Fontenot <nathan.fontenot@....com>,
	Terry Bowman <terry.bowman@....com>, Robert Richter <rrichter@....com>,
	Benjamin Cheatham <benjamin.cheatham@....com>, Zhijian Li
	<lizhijian@...itsu.com>, Borislav Petkov <bp@...en8.de>, Tomasz Wolski
	<tomasz.wolski@...itsu.com>
Subject: Re: [PATCH v5 6/7] dax/hmem, cxl: Defer and resolve ownership of Soft
 Reserved memory ranges

Koralahalli Channabasappa, Smita wrote:
> Hi Alison,
> 
> On 1/26/2026 2:33 PM, Alison Schofield wrote:
> > On Mon, Jan 26, 2026 at 01:05:47PM -0800, Koralahalli Channabasappa, Smita wrote:
> >> Hi Alison,
> >>
> >> On 1/22/2026 10:35 PM, Alison Schofield wrote:
> >>> On Thu, Jan 22, 2026 at 04:55:42AM +0000, Smita Koralahalli wrote:
> >>>> The current probe time ownership check for Soft Reserved memory based
> >>>> solely on CXL window intersection is insufficient. dax_hmem probing is not
> >>>> always guaranteed to run after CXL enumeration and region assembly, which
> >>>> can lead to incorrect ownership decisions before the CXL stack has
> >>>> finished publishing windows and assembling committed regions.
> >>>>
> >>>> Introduce deferred ownership handling for Soft Reserved ranges that
> >>>> intersect CXL windows at probe time by scheduling deferred work from
> >>>> dax_hmem and waiting for the CXL stack to complete enumeration and region
> >>>> assembly before deciding ownership.
> >>>>
> >>>> Evaluate ownership of Soft Reserved ranges based on CXL region
> >>>> containment.
> >>>>
> >>>>      - If all Soft Reserved ranges are fully contained within committed CXL
> >>>>        regions, DROP handling Soft Reserved ranges from dax_hmem and allow
> >>>>        dax_cxl to bind.
> >>>>
> >>>>      - If any Soft Reserved range is not fully claimed by committed CXL
> >>>>        region, tear down all CXL regions and REGISTER the Soft Reserved
> >>>>        ranges with dax_hmem instead.
> >>>>
> >>>> While ownership resolution is pending, gate dax_cxl probing to avoid
> >>>> binding prematurely.
> >>>
> >>> This patch is the point in the set where I begin to fail creating DAX
> >>> regions on my non soft-reserved platforms.
> >>>
> >>> Before this patch, at region probe, devm_cxl_add_dax_region(cxlr) succeeded
> >>> without delay, but now those calls result in EPROBE DEFER.
> >>>
> >>> That deferral is wanted for platforms with Soft Reserveds, but for
> >>> platforms without, those probes will never resume.
> >>>
> >>> IIUC this will impact platforms without SRs, not just my test setup.
> >>> In my testing it's visible during both QEMU and cxl-test region creation.
> >>>
> >>> Can we abandon this whole deferral scheme if there is nothing in the
> >>> new soft_reserved resource tree?
> >>>
> >>> Or maybe another way to get the dax probes UN-deferred in this case?
> >>
> >> Thanks for pointing this. I didn't think through this.
> >>
> >> I was thinking to make the deferral conditional on HMEM actually observing a
> >> CXL-overlapping range. Rough flow:
> >>
> >> One assumption I'm relying on here is that dax_hmem and "initial"
> >> hmem_register_device() walk happens before dax_cxl probes. If that
> >> assumption doesn’t hold this approach may not be sufficient.
> >>
> >> 1. Keep dax_cxl_mode default as DEFER as it is now in dax/bus.c
> >> 2. Introduce need_deferral flag initialized to false in dax/bus.c
> >> 3. During the initial dax_hmem walk, in hmem_register_device() if HMEM
> >> observes SR that intersects IORES_DESC_CXL, set a need_deferral flag and
> >> schedule the deferred work. (case DEFER)
> >> 4. In dax_cxl probe: only return -EPROBE_DEFER when dax_cxl_mode == DEFER
> >> and need_deferral is set, otherwise proceed with cxl_dax.
> >>
> >> Please call out if you see issues with this approach (especially around the
> >> ordering assumption).
> > 
> > 
> > A quick thought to share -
> > 
> > Will the 'need_deferral' flag be cleared when all deferred work is
> > done, so that case 2) below can succeed:
> 
> My thinking was that we don’t strictly need to clear need_deferral as 
> long as dax_cxl_mode is the actual gate. need_deferral would only be set 
> when HMEM observes an SR range intersecting IORES_DESC_CXL, and after 
> the deferred work runs we should always transition dax_cxl_mode from 
> DEFER to either DROP or REGISTER. At that point dax_cxl won’t return 
> EPROBE_DEFER anymore regardless of the flag value.
> 
> I also had a follow-up thought: rather than a separate need_deferral 
> flag, we could make this explicit in the mode enum. For example, keep 
> DEFER as the default, and when hmem_register_device() first observes a 
> SR and CXL intersection, transition the mode from DEFER to something 
> like NEEDS_CHANGE. Then dax_cxl would only return -EPROBE_DEFER in the 
> NEEDS_CHANGE state, and once the deferred work completes it would move 
> the mode to DROP or REGISTER.
> 
> Please correct me if I’m missing a case where dax_cxl_mode could remain 
> DEFER even after setting the flag.

Recall that DAX_CXL_MODE_DEFER is about what to do about arriving
hmem_register_device() events. Until CXL comes up those all get deferred
to the workqueue. When the workqueue runs it flushes both the
hmem_register_device() events from probing the HMAT and
cxl_region_probe() events from initial PCI discovery.

The deferral never needs to be visible to cxl_dax_region_probe() outside
of knowing that the workqueue has had a chance to flush at least once.

If we go with the alloc_dax_region() observation in my other mail it
means that the HPA space will already be claimed and
cxl_dax_region_probe() will fail. If we can get to that point of "all
HMEM registered, and all CXL regions failing to attach their
cxl_dax_region devices" that is a good stopping point. Then can decide
if a follow-on patch is needed to cleanup that state
(cxl_region_teardown_all()) , or if it can just idle that way in that
messy state and wait for userspace to cleanup if it wants.

Might want an error message to point people to report their failing
system configuration to the linux-cxl@ mailing list.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ