linux-kernel - Re: [PATCH 0/6] cxl: Initialization reworks in support Soft Reserve Recovery and Accelerator Memory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6937d59160f7d_1b2e1001@dwillia2-mobl4.notmuch>
Date: Tue, 9 Dec 2025 16:53:53 +0900
From: <dan.j.williams@...el.com>
To: Alejandro Lucero Palau <alucerop@....com>, <dan.j.williams@...el.com>,
	<dave.jiang@...el.com>
CC: <linux-cxl@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<Smita.KoralahalliChannabasappa@....com>, <alison.schofield@...el.com>,
	<terry.bowman@....com>, <alejandro.lucero-palau@....com>,
	<linux-pci@...r.kernel.org>, <Jonathan.Cameron@...wei.com>, Shiju Jose
	<shiju.jose@...wei.com>
Subject: Re: [PATCH 0/6] cxl: Initialization reworks in support Soft Reserve
 Recovery and Accelerator Memory

Alejandro Lucero Palau wrote:
[..]
> If there is no CXL properly initialized, what also implies a PCI-only 
> slot, the driver can know looking at the CXL.mem and CXL.cache status in 
> the CXL control register. That is what sfc driver does now using Terry's 
> patchset instead of only checking CXL DVSEC and trying further CXL 
> initialization using the CXL core API for Type2. Neither call to create 
> cxl dev state nor memdev is needed to figure out. Of course, those calls 
> can point to another kind of problem, but the driver can find out 
> without using them.

It can, but I am not sure why a driver would want to open code a partial
answer to that question and not just rely on the CXL core to do the full
determination?

> >> The HW will support CXL or PCI, and if
> >> CXL mem is not enabled by the firmware, likely due to a
> >> negotiation/linking problem, the driver can keep going with CXL.io.
> > Right, I think we are in violent agreement.
> >
> >> Of course, this is from my experience with sfc driver/hardware. Note
> >> sfc driver added the check for CXL availability based on Terry's v13.
> > Note that Terry's check for CXL availabilty is purely a hardware
> > detection, there are still software reasons why cxl_acpi and cxl_mem
> > can prevent devm_cxl_add_memdev() success.
> >
> >> But this is useful for solving the problem of module removal which can
> >> leave the type2 driver without the base for doing any unwinding. Once a
> >> type2 uses code from those other cxl modules explicitly, the problem is
> >> avoided. You seem to have forgotten about this problem, what I think it
> >> is worth to describe.
> > What problem exactly? If it needs to be captured in these changelogs or
> > code comments, let me know.
> 
> 
> It is a surprise you not remembering this ...

I did not immediately recognize that this statement: "problem of module
removal which can leave the type2 driver without the base for doing any
unwinding". This set is about init time fixes so talking about removal
through me for a loop.

Thanks for the additional context below.

> v17 tried to fix this problem which was pointed out in v16 by you in 
> several patches.
> 
> 
> v17:
> 
> https://lore.kernel.org/linux-cxl/6887b72724173_11968100cb@dwillia2-mobl4.notmuch/
> 
> Next my reply to another comment from you trying to clarify/enumerate 
> different problems which were getting intertwined creating confusion (at 
> least to me). Sadly none did comment further, likely none read my 
> explanation ... even if I asked for it with another email and 
> specifically in one community meeting:
> 
> https://lore.kernel.org/linux-cxl/836d06d6-a36f-4ba3-b7c9-ba8687ba2190@amd.com/

So this also is about the init race, not removal, right?

This is why I think Smita's patches are a precursor to Type-2 because
both need that sync-point to when that platform CXL initialization has
completed.

> Next discussion about trying to solve the modules removal adding a 
> callback by the driver which you did not like:
> 
> https://lore.kernel.org/linux-cxl/6892325deccdb_55f09100fb@dwillia2-xfh.jf.intel.com.notmuch/
> 

A proposal that implements what I talk about there is something like
this:

diff --git a/drivers/cxl/core/memdev.c b/drivers/cxl/core/memdev.c
index 51a07cd85c7b..a4cb6d0f0da7 100644
--- a/drivers/cxl/core/memdev.c
+++ b/drivers/cxl/core/memdev.c
@@ -641,6 +641,16 @@ static void detach_memdev(struct work_struct *work)
 	struct cxl_memdev *cxlmd;
 
 	cxlmd = container_of(work, typeof(*cxlmd), detach_work);
+
+	/*
+	 * Default to detaching the memdev, but in the case of memdev ops the
+	 * memdev creator may want to detach the parent device as well.
+	 */
+	if (cxlmd->ops && cxlmd->ops->detach) {
+		cxlmd->ops->detach(cxlmd);
+		return;
+	}
+
 	device_release_driver(&cxlmd->dev);
 	put_device(&cxlmd->dev);
 }

Where that detach implementation is something like:

void accelerator_driver_detach(struct cxl_memdev *cxlmd)
{
	device_release_driver(cxlmd->dev.parent);
	/* the above also detaches the cxlmd via devm action */
	put_device(&cxlmd->dev);
}

What I am not sure about is whether the accelerator driver needs the
ability to do anything besides shutdown when the CXL hierarchy is torn
down. I.e. no ->detach() callback, just make it the rule that when @ops
are specified devm_cxl_add_memdev() failure is a permanent failure at
registration time and CXL hierarchy removal also takes down the
accelerator driver.