[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251230142736.1168-1-guojinhui.liam@bytedance.com>
Date: Tue, 30 Dec 2025 22:27:36 +0800
From: "Jinhui Guo" <guojinhui.liam@...edance.com>
To: <tj@...nel.org>
Cc: <alexander.h.duyck@...ux.intel.com>, <bhelgaas@...gle.com>,
<bvanassche@....org>, <dan.j.williams@...el.com>,
<gregkh@...uxfoundation.org>, <guojinhui.liam@...edance.com>,
<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>,
<stable@...r.kernel.org>
Subject: Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers
On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > already running inside an unbounded asynchronous worker. Because a driver
> > can be probed asynchronously either by probe_type or by the kernel command
> > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > executing within an unbounded workqueue worker and should skip the extra
> > work_on_cpu() call.
>
> Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> workqueue)? Those are soft-affine to cache domain but can overflow to other
> CPUs?
Hi, tejun,
Thank you for your time and helpful suggestions.
I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
flush_work(), but that would be a refactor rather than a fix for the specific
problem we hit.
Let me restate the issue:
1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
async_wq to speed up driver probe.
2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
device’s NUMA node, but it always picks the same CPU for every device on
that node, forcing the PCI probes to run serially.
Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
inside an async_wq worker and skip the extra nested work queue.
I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
would be cleaner and would let different vendors’ drivers probe in parallel
instead of fighting over the same CPU. I’ve prepared and tested another patch,
but I’m still unsure it’s the better approach; any further suggestions would
be greatly appreciated.
Test results for that patch:
nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns
Key changes in the patch:
1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
3. Remove all cpumask operations.
4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
are gone.
The patch is shown below.
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 7c2d9d5962586..e66a67c48f28d 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -347,10 +347,24 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
#endif
}
+struct pci_probe_work {
+ struct work_struct work;
+ struct drv_dev_and_id ddi;
+ int result;
+};
+
+static void pci_probe_work_func(struct work_struct *work)
+{
+ struct pci_probe_work *pw = container_of(work, struct pci_probe_work, work);
+
+ pw->result = local_pci_probe(&pw->ddi);
+}
+
static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
const struct pci_device_id *id)
{
int error, node, cpu;
+ struct pci_probe_work pw;
struct drv_dev_and_id ddi = { drv, dev, id };
/*
@@ -361,38 +375,25 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
node = dev_to_node(&dev->dev);
dev->is_probed = 1;
- cpu_hotplug_disable();
-
/*
* Prevent nesting work_on_cpu() for the case where a Virtual Function
* device is probed from work_on_cpu() of the Physical device.
*/
if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
- pci_physfn_is_probed(dev)) {
- cpu = nr_cpu_ids;
- } else {
- cpumask_var_t wq_domain_mask;
-
- if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
- error = -ENOMEM;
- goto out;
- }
- cpumask_and(wq_domain_mask,
- housekeeping_cpumask(HK_TYPE_WQ),
- housekeeping_cpumask(HK_TYPE_DOMAIN));
-
- cpu = cpumask_any_and(cpumask_of_node(node),
- wq_domain_mask);
- free_cpumask_var(wq_domain_mask);
+ pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
+ error = local_pci_probe(&ddi);
+ goto out;
}
- if (cpu < nr_cpu_ids)
- error = work_on_cpu(cpu, local_pci_probe, &ddi);
- else
- error = local_pci_probe(&ddi);
+ INIT_WORK_ONSTACK(&pw.work, pci_probe_work_func);
+ pw.ddi = ddi;
+ queue_work_node(node, system_dfl_wq, &pw.work);
+ flush_work(&pw.work);
+ error = pw.result;
+ destroy_work_on_stack(&pw.work);
+
out:
dev->is_probed = 0;
- cpu_hotplug_enable();
return error;
}
Best Regards,
Jinhui
Powered by blists - more mailing lists