lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20251231075105.1368-1-guojinhui.liam@bytedance.com>
Date: Wed, 31 Dec 2025 15:51:05 +0800
From: "Jinhui Guo" <guojinhui.liam@...edance.com>
To: <helgaas@...nel.org>
Cc: <alexander.h.duyck@...ux.intel.com>, <bhelgaas@...gle.com>, 
	<bvanassche@....org>, <dan.j.williams@...el.com>, 
	<gregkh@...uxfoundation.org>, <guojinhui.liam@...edance.com>, 
	<linux-kernel@...r.kernel.org>, <linux-pci@...r.kernel.org>, 
	<stable@...r.kernel.org>, <tj@...nel.org>
Subject: Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers

On Tue, Dec 30, 2025 at 03:52:41PM -0600, Bjorn Helgaas wrote:
> On Tue, Dec 30, 2025 at 10:27:36PM +0800, Jinhui Guo wrote:
> > On Mon, Dec 29, 2025 at 08:08:57AM -1000, Tejun Heo wrote:
> > > On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> > > > To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> > > > already running inside an unbounded asynchronous worker. Because a driver
> > > > can be probed asynchronously either by probe_type or by the kernel command
> > > > line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> > > > the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> > > > executing within an unbounded workqueue worker and should skip the extra
> > > > work_on_cpu() call.
> > > 
> > > Why not just use queue_work_on() on system_dfl_wq (or any other unbound
> > > workqueue)? Those are soft-affine to cache domain but can overflow to other
> > > CPUs?
> > 
> > Hi, tejun,
> > 
> > Thank you for your time and helpful suggestions.
> > I had considered replacing work_on_cpu() with queue_work_on(system_dfl_wq) +
> > flush_work(), but that would be a refactor rather than a fix for the specific
> > problem we hit.
> > 
> > Let me restate the issue:
> > 
> > 1. With PROBE_PREFER_ASYNCHRONOUS enabled, the driver core queues work on
> >    async_wq to speed up driver probe.
> > 2. The PCI core then calls work_on_cpu() to tie the probe thread to the PCI
> >    device’s NUMA node, but it always picks the same CPU for every device on
> >    that node, forcing the PCI probes to run serially.
> > 
> > Therefore I test current->flags & PF_WQ_WORKER to detect that we are already
> > inside an async_wq worker and skip the extra nested work queue.
> > 
> > I agree with your point—using queue_work_on(system_dfl_wq) + flush_work()
> > would be cleaner and would let different vendors’ drivers probe in parallel
> > instead of fighting over the same CPU. I’ve prepared and tested another patch,
> > but I’m still unsure it’s the better approach; any further suggestions would
> > be greatly appreciated.
> > 
> > Test results for that patch:
> >   nvme 0000:01:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 34904955 ns
> >   nvme 0000:02:00.0: CPU: 134, COMM: kworker/u1025:1, probe cost: 34774235 ns
> >   nvme 0000:03:00.0: CPU: 1, COMM: kworker/u1025:4, probe cost: 34573054 ns
> > 
> > Key changes in the patch:
> > 
> > 1. Keep the current->flags & PF_WQ_WORKER test to avoid nested workers.
> > 2. Replace work_on_cpu() with queue_work_node(system_dfl_wq) + flush_work()
> >    to enable parallel probing when PROBE_PREFER_ASYNCHRONOUS is disabled.
> > 3. Remove all cpumask operations.
> > 4. Drop cpu_hotplug_disable() since both cpumask manipulation and work_on_cpu()
> >    are gone.
> > 
> > The patch is shown below.
> 
> I love this patch because it makes pci_call_probe() so much simpler.
> 
> I *would* like a short higher-level description of the issue that
> doesn't assume so much workqueue background.
> 
> I'm not an expert, but IIUC __driver_attach() schedules async workers
> so driver probes can run in parallel, but the problem is that the
> workers for devices on node X are currently serialized because they
> all bind to the same CPU on that node.
> 
> Naive questions: It looks like async_schedule_dev() already schedules
> an async worker on the device node, so why does pci_call_probe() need
> to use queue_work_node() again?
> 
> pci_call_probe() dates to 2005 (d42c69972b85 ("[PATCH] PCI: Run PCI
> driver initialization on local node")), but the async_schedule_dev()
> looks like it was only added in 2019 (c37e20eaf4b2 ("driver core:
> Attach devices on CPU local to device node")).  Maybe the
> pci_call_probe() node awareness is no longer necessary?

Hi, Bjorn

Thank you for your time and kind reply.

As I see it, two scenarios should be borne in mind:

1. Driver allowed to probe asynchronously
   The driver core schedules async workers via async_schedule_dev(),
   so pci_call_probe() needs no extra queue_work_node().

2. Driver not allowed to probe asynchronously
   The driver core (__driver_attach() or __device_attach()) calls
   pci_call_probe() directly, without any async worker from
   async_schedule_dev(). NUMA-node awareness in pci_call_probe()
   is therefore still required.

Best Regards,
Jinhui

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ