linux-kernel - Re: [PATCH] PCI: Avoid work_on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251230142736.1168-1-guojinhui.liam@bytedance.com>
Date: Tue, 30 Dec 2025 22:27:36 +0800
From: "Jinhui Guo" <guojinhui.liam@...edance.com>
To: <tj@...nel.org>
Cc: <alexander.h.duyck@...ux.intel.com>, <bhelgaas@...gle.com>, 
	<bvanassche@....org>, <dan.j.williams@...el.com>, 
	<gregkh@...uxfoundation.org>, <guojinhui.liam@...edance.com>, 
	<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>, 
	<stable@...r.kernel.org>
Subject: Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > already running inside an unbounded asynchronous worker. Because a driver
> > can be probed asynchronously either by probe_type or by the kernel command
> > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > executing within an unbounded workqueue worker and should skip the extra
> > work_on_cpu() call.
> 
> Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> workqueue)? Those are soft-affine to cache domain but can overflow to other
> CPUs?

Hi, tejun,

Thank you for your time and helpful suggestions.
I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
flush_work(), but that would be a refactor rather than a fix for the specific
problem we hit.

Let me restate the issue:

1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
   async_wq to speed up driver probe.
2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
   device’s NUMA node, but it always picks the same CPU for every device on
   that node, forcing the PCI probes to run serially.

Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
inside an async_wq worker and skip the extra nested work queue.

I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
would be cleaner and would let different vendors’ drivers probe in parallel
instead of fighting over the same CPU. I’ve prepared and tested another patch,
but I’m still unsure it’s the better approach; any further suggestions would
be greatly appreciated.

Test results for that patch:
  nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
  nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
  nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns

Key changes in the patch:

1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
   to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
3. Remove all cpumask operations.
4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
   are gone.

The patch is shown below.

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 7c2d9d5962586..e66a67c48f28d 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -347,10 +347,24 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
 #endif
 }

+struct pci_probe_work {
+    struct work_struct work;
+    struct drv_dev_and_id ddi;
+    int result;
+};
+
+static void pci_probe_work_func(struct work_struct *work)
+{
+       struct pci_probe_work *pw = container_of(work, struct pci_probe_work, work);
+
+       pw->result = local_pci_probe(&pw->ddi);
+}
+
 static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
                          const struct pci_device_id *id)
 {
        int error, node, cpu;
+       struct pci_probe_work pw;
        struct drv_dev_and_id ddi = { drv, dev, id };

        /*
@@ -361,38 +375,25 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
        node = dev_to_node(&dev->dev);
        dev->is_probed = 1;

-       cpu_hotplug_disable();
-
        /*
         * Prevent nesting work_on_cpu() for the case where a Virtual Function
         * device is probed from work_on_cpu() of the Physical device.
         */
        if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
-           pci_physfn_is_probed(dev)) {
-               cpu = nr_cpu_ids;
-       } else {
-               cpumask_var_t wq_domain_mask;
-
-               if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
-                       error = -ENOMEM;
-                       goto out;
-               }
-               cpumask_and(wq_domain_mask,
-                           housekeeping_cpumask(HK_TYPE_WQ),
-                           housekeeping_cpumask(HK_TYPE_DOMAIN));
-
-               cpu = cpumask_any_and(cpumask_of_node(node),
-                                     wq_domain_mask);
-               free_cpumask_var(wq_domain_mask);
+           pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
+               error = local_pci_probe(&ddi);
+               goto out;
        }

-       if (cpu < nr_cpu_ids)
-               error = work_on_cpu(cpu, local_pci_probe, &ddi);
-       else
-               error = local_pci_probe(&ddi);
+       INIT_WORK_ONSTACK(&pw.work, pci_probe_work_func);
+       pw.ddi = ddi;
+       queue_work_node(node, system_dfl_wq, &pw.work);
+       flush_work(&pw.work);
+       error = pw.result;
+       destroy_work_on_stack(&pw.work);
+
 out:
        dev->is_probed = 0;
-       cpu_hotplug_enable();
        return error;
 }


Best Regards,
Jinhui