linux-kernel - Re: [PATCH] PCI: Avoid work_on

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251230215241.GA130710@bhelgaas>
Date: Tue, 30 Dec 2025 15:52:41 -0600
From: Bjorn Helgaas <helgaas@...nel.org>
To: Jinhui Guo <guojinhui.liam@...edance.com>
Cc: tj@...nel.org, alexander.h.duyck@...ux.intel.com, bhelgaas@...gle.com,
	bvanassche@....org, dan.j.williams@...el.com,
	gregkh@...uxfoundation.org, linux-kernel@...r.kernel.org,
	linux-pci@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

On Tue, Dec 30, 2025 at 10:27:36PM +0800, Jinhui Guo wrote:
> On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> > On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > > already running inside an unbounded asynchronous worker. Because a driver
> > > can be probed asynchronously either by probe_type or by the kernel command
> > > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > > executing within an unbounded workqueue worker and should skip the extra
> > > work_on_cpu() call.
> > 
> > Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> > workqueue)? Those are soft-affine to cache domain but can overflow to other
> > CPUs?
> 
> Hi, tejun,
> 
> Thank you for your time and helpful suggestions.
> I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
> flush_work(), but that would be a refactor rather than a fix for the specific
> problem we hit.
> 
> Let me restate the issue:
> 
> 1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
>    async_wq to speed up driver probe.
> 2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
>    device’s NUMA node, but it always picks the same CPU for every device on
>    that node, forcing the PCI probes to run serially.
> 
> Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
> inside an async_wq worker and skip the extra nested work queue.
> 
> I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
> would be cleaner and would let different vendors’ drivers probe in parallel
> instead of fighting over the same CPU. I’ve prepared and tested another patch,
> but I’m still unsure it’s the better approach; any further suggestions would
> be greatly appreciated.
> 
> Test results for that patch:
>   nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
>   nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
>   nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns
> 
> Key changes in the patch:
> 
> 1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
> 2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
>    to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
> 3. Remove all cpumask operations.
> 4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
>    are gone.
> 
> The patch is shown below.

I love this patch because it makes pci_call_probe() so much simpler.

I *would* like a short higher-level description of the issue that
doesn't assume so much workqueue background.

I'm not an expert, but IIUC __driver_attach() schedules async workers
so driver probes can run in parallel, but the problem is that the
workers for devices on node X are currently serialized because they
all bind to the same CPU on that node.

Naive questions: It looks like async_schedule_dev() already schedules
an async worker on the device node, so why does pci_call_probe() need
to use queue_work_node() again?

pci_call_probe() dates to 2005 (d42c69972b85 ("[PATCH] PCI: Run PCI
driver initialization on local node")), but the async_schedule_dev()
looks like it was only added in 2019 (c37e20eaf4b2 ("driver core:
Attach devices on CPU local to device node")).  Maybe the
pci_call_probe() node awareness is no longer necessary?

> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 7c2d9d5962586..e66a67c48f28d 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -347,10 +347,24 @@ static bool pci_physfn_is_probed(struct pci_dev *dev)
>  #endif
>  }
> 
> +struct pci_probe_work {
> +    struct work_struct work;
> +    struct drv_dev_and_id ddi;
> +    int result;
> +};
> +
> +static void pci_probe_work_func(struct work_struct *work)
> +{
> +       struct pci_probe_work *pw = container_of(work, struct pci_probe_work, work);
> +
> +       pw->result = local_pci_probe(&pw->ddi);
> +}
> +
>  static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>                           const struct pci_device_id *id)
>  {
>         int error, node, cpu;
> +       struct pci_probe_work pw;
>         struct drv_dev_and_id ddi = { drv, dev, id };
> 
>         /*
> @@ -361,38 +375,25 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
>         node = dev_to_node(&dev->dev);
>         dev->is_probed = 1;
> 
> -       cpu_hotplug_disable();
> -
>         /*
>          * Prevent nesting work_on_cpu() for the case where a Virtual Function
>          * device is probed from work_on_cpu() of the Physical device.
>          */
>         if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
> -           pci_physfn_is_probed(dev)) {
> -               cpu = nr_cpu_ids;
> -       } else {
> -               cpumask_var_t wq_domain_mask;
> -
> -               if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
> -                       error = -ENOMEM;
> -                       goto out;
> -               }
> -               cpumask_and(wq_domain_mask,
> -                           housekeeping_cpumask(HK_TYPE_WQ),
> -                           housekeeping_cpumask(HK_TYPE_DOMAIN));
> -
> -               cpu = cpumask_any_and(cpumask_of_node(node),
> -                                     wq_domain_mask);
> -               free_cpumask_var(wq_domain_mask);
> +           pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
> +               error = local_pci_probe(&ddi);
> +               goto out;
>         }
> 
> -       if (cpu < nr_cpu_ids)
> -               error = work_on_cpu(cpu, local_pci_probe, &ddi);
> -       else
> -               error = local_pci_probe(&ddi);
> +       INIT_WORK_ONSTACK(&pw.work, pci_probe_work_func);
> +       pw.ddi = ddi;
> +       queue_work_node(node, system_dfl_wq, &pw.work);
> +       flush_work(&pw.work);
> +       error = pw.result;
> +       destroy_work_on_stack(&pw.work);
> +
>  out:
>         dev->is_probed = 0;
> -       cpu_hotplug_enable();
>         return error;
>  }
> 
> 
> Best Regards,
> Jinhui