[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6700836317627_964f2294f2@dwillia2-xfh.jf.intel.com.notmuch>
Date: Fri, 4 Oct 2024 17:08:03 -0700
From: Dan Williams <dan.j.williams@...el.com>
To: Gregory Price <gourry@...rry.net>, <linux-cxl@...r.kernel.org>
CC: <linux-kernel@...r.kernel.org>, <dan.j.williams@...el.com>,
<Jonathan.Cameron@...wei.com>, <dave.jiang@...el.com>,
<alison.schofield@...el.com>, <vishal.l.verma@...el.com>,
<ira.weiny@...el.com>, <rrichter@....com>, <terry.bowman@....com>,
<dave@...olabs.net>
Subject: Re: [PATCH v2] cxl/core/port: defer endpoint probes when ACPI likely
hasn't finished
Gregory Price wrote:
> In cxl_acpi_probe, we add dports and uports to host bridges iteratively:
> - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_dport);
> - bus_for_each_dev(adev->dev.bus, NULL, root_port, add_host_bridge_uport);
>
> Simultaneously, as ports are probed, memdev endpoints can also be
> probed. This creates a race condition, where an endpoint can perceive
> its path to the root being broken in devm_cxl_enumerate_ports.
>
> The memdev/endpoint probe will see a heirarchy that may look like:
> mem1
> parent => 0000:c1:00.0
> parent => 0000:c0:01.1
> parent->parent => NULL
>
> This results in find_cxl_port() returning NULL (since the port hasn't
> been associated with the host bridge yet), and add_port_attach_ep
> fails because the grandparent's grandparent is NULL.
>
> When the latter condition is detected, the comments suggest:
> /*
> * The iteration reached the topology root without finding the
> * CXL-root 'cxl_port' on a previous iteration, fail for now to
> * be re-probed after platform driver attaches.
> */
>
> This case results in an -ENXIO; however, a re-probe never occurs. Change
> this return condition to -EPROBE_DEFER to explicitly cause a reprobe.
Ok, thanks for the additional debug. Like we chatted on the CXL Discord
I think this is potentially pointing to a bug in bus_rescan_devices()
where it checks dev->driver without holding the lock.
Can you give this fix a try to see if it also resolves the issue?
Effectively, cxl_bus_rescan() is always needed in case the cxl_acpi
driver loads waaaay after deferred probing has given up, and if this
works then EPROBE_DEFER can remain limited to cases where it is
absolutely known that no other device_attach() kick is coming to save
the day.
-- 8< --
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 1d5007e3795a..6c0cd94888a3 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -2088,11 +2088,18 @@ static void cxl_bus_remove(struct device *dev)
static struct workqueue_struct *cxl_bus_wq;
-static void cxl_bus_rescan_queue(struct work_struct *w)
+static int attach_device(struct device *dev, void *data)
{
- int rc = bus_rescan_devices(&cxl_bus_type);
+ int rc = device_attach(dev);
+
+ dev_vdbg(dev, "rescan: %s\n", rc ? "attach" : "detached");
- pr_debug("CXL bus rescan result: %d\n", rc);
+ return 0;
+}
+
+static void cxl_bus_rescan_queue(struct work_struct *w)
+{
+ bus_for_each_dev(&cxl_bus_type, NULL, NULL, attach_device);
}
void cxl_bus_rescan(void)
Powered by blists - more mailing lists