[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260107175548.1792-1-guojinhui.liam@bytedance.com>
Date: Thu, 8 Jan 2026 01:55:45 +0800
From: "Jinhui Guo" <guojinhui.liam@...edance.com>
To: <dakr@...nel.org>, <alexander.h.duyck@...ux.intel.com>,
<alexanderduyck@...com>, <bhelgaas@...gle.com>, <bvanassche@....org>,
<dan.j.williams@...el.com>, <gregkh@...uxfoundation.org>,
<helgaas@...nel.org>, <rafael@...nel.org>, <tj@...nel.org>
Cc: <guojinhui.liam@...edance.com>, <linux-kernel@...r.kernel.org>,
<linux-pci@...r.kernel.org>
Subject: [PATCH 0/3] Add NUMA-node-aware synchronous probing to driver core
Hi all,
** Overview **
This patchset introduces NUMA-node-aware synchronous probing.
Drivers can initialize and allocate memory on the device’s local
node without scattering kmalloc_node() calls throughout the code.
NUMA-aware probing was added to PCI drivers in 2005 and has
benefited them ever since.
The asynchronous probe path already supports NUMA-node-aware
probing via async_schedule_dev() in the driver core. Since NUMA
affinity is orthogonal to sync/async probing, this patchset adds
NUMA-node-aware support to the synchronous probe path.
** Background **
The idea arose from a discussion with Bjorn and Danilo about a
PCI-probe issue [1]:
when PCI devices on the same NUMA node are probed asynchronously,
pci_call_probe() calls work_on_cpu(), pins every probe worker to
the same CPU inside that node, and forces the probes to run serially.
Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor (all on CPU 0):
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
Since the driver core already provides NUMA-node-aware asynchronous
probing, we can extend the same capability to the synchronous probe
path. This solves the issue and lets other drivers benefit from
NUMA-local initialization as well.
[1] https://lore.kernel.org/all/20251227113326.964-1-guojinhui.liam@bytedance.com/
** Changes **
The series makes three main changes:
1. Adds helper __device_attach_driver_scan() to eliminate duplication
between __device_attach() and __device_attach_async_helper().
2. Introduces a NUMA-node-aware execution mechanism and uses it to
enable NUMA-local synchronous probing in __device_attach(),
device_driver_attach(), and __driver_attach().
3. Removes the now-redundant NUMA code from the PCI driver.
** Test **
I added debug prints to nvme, mlx5, usbhid, and intel_rapl_msr and
ran tests on an AMD EPYC 9A64 system:
1. Without the patchset
- PCI drivers (nvme, mlx5) probe sequentially on CPU 0
- USB and platform drivers pick random CPUs in the udev worker
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, cost: 54013202 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, cost: 53968911 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:4, cost: 48077276 ns
mlx5_core 0000:41:00.0: CPU: 0, COMM: kworker/0:2 cost: 506256717 ns
mlx5_core 0000:41:00.1: CPU: 0, COMM: kworker/0:2 cost: 514289394 ns
usb 1-2.4: CPU: 163, COMM: (udev-worker), cost 854131 ns
usb 1-2.6: CPU: 163, COMM: (udev-worker), cost 967993 ns
intel_rapl_msr intel_rapl_msr.0: CPU: 61, COMM: (udev-worker), cost: 3717567 ns
2. With the patchset
- PCI probes are spread across CPUs inside the device’s NUMA node
- Asynchronous nvme probes are ~35 % faster; synchronous mlx5 times
are unchanged
- USB probe times are virtually identical
- Platform driver (no NUMA node) falls back to the original path
nvme 0000:01:00.0: CPU: 130, COMM: kworker/u1025:0, cost: 35074561 ns
nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:6, cost: 34612117 ns
nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:5, cost: 34802918 ns
mlx5_core 0000:41:00.0: CPU: 128, COMM: kworker/u1025:0, cost: 506214576 ns
mlx5_core 0000:41:00.1: CPU: 128, COMM: kworker/u1025:0, cost: 514273565 ns
usb 1-2.4: CPU: 51, COMM: kworker/u1031:2, cost: 933581 ns
usb 1-2.6: CPU: 51, COMM: kworker/u1031:2, cost: 957237 ns
intel_rapl_msr intel_rapl_msr.0: CPU: 225, COMM: (udev-worker), cost: 4715967 ns
3. With the patchset, unbind/bind cycles also spread PCI probes across
CPUs within the device’s NUMA node:
nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:4, cost: 37070897 ns
** Final **
Comments and suggestions are welcome.
Best Regards,
Jinhui
---
Jinhui Guo (3):
driver core: Introduce helper function __device_attach_driver_scan()
driver core: Add NUMA-node awareness to the synchronous probe path
PCI: Clean up NUMA-node awareness in pci_bus_type probe
drivers/base/dd.c | 173 +++++++++++++++++++++++++++++++--------
drivers/pci/pci-driver.c | 83 ++-----------------
include/linux/pci.h | 1 -
3 files changed, 148 insertions(+), 109 deletions(-)
--
2.20.1
Powered by blists - more mailing lists