[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20260122145208.1013-1-guojinhui.liam@bytedance.com>
Date: Thu, 22 Jan 2026 22:52:05 +0800
From: "Jinhui Guo" <guojinhui.liam@...edance.com>
To: <dakr@...nel.org>, <alexanderduyck@...com>, <bhelgaas@...gle.com>,
<bvanassche@....org>, <dan.j.williams@...el.com>,
<gregkh@...uxfoundation.org>, <helgaas@...nel.org>, <rafael@...nel.org>,
<tj@...nel.org>, <frederic@...nel.org>
Cc: <guojinhui.liam@...edance.com>, <linux-kernel@...r.kernel.org>,
<linux-pci@...r.kernel.org>
Subject: [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core
Hi all,
** Overview **
This patchset introduces NUMA-node-aware synchronous probing.
Drivers can initialize and allocate memory on the device’s local
node without scattering kmalloc_node() calls throughout the code.
NUMA-aware probing was added to PCI drivers in 2005 and has
benefited them ever since.
The asynchronous probe path already supports NUMA-node-aware
probing via async_schedule_dev() in the driver core. Since NUMA
affinity is orthogonal to sync/async probing, this patchset adds
NUMA-node-aware support to the synchronous probe path.
** Background **
The idea arose from a discussion with Bjorn and Danilo about a
PCI-probe issue [1]:
when PCI devices on the same NUMA node are probed asynchronously,
pci_call_probe() calls work_on_cpu(), pins every probe worker to
the same CPU inside that node, and forces the probes to run serially.
Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor (all on CPU 0):
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
Since the driver core already provides NUMA-node-aware asynchronous
probing, we can extend the same capability to the synchronous probe
path. This solves the issue and lets other drivers benefit from
NUMA-local initialization as well.
[1] https://lore.kernel.org/all/20251227113326.964-1-guojinhui.liam@bytedance.com/
** Changes **
The series makes three main changes:
1. Adds helper __device_attach_driver_scan() to eliminate duplication
between __device_attach() and __device_attach_async_helper().
2. Introduce helper __driver_probe_device_node() and use it to enable
NUMA-local synchronous probing in __device_attach(), device_driver_attach(),
and __driver_attach().
3. Removes the now-redundant NUMA code from the PCI driver.
** Test **
I added debug prints to nvme, mlx5, usbhid, and intel_rapl_msr and
ran tests on an AMD EPYC 9A64 system (Updated test results for the
new patchset are provided below):
NUMA topology of the test machine:
# lscpu |grep NUMA
NUMA node(s): 2
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
1. Without the patchset
- PCI drivers (nvme, mlx5) probe sequentially on CPU 0
- USB and platform drivers pick random CPUs in the udev worker
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, cost: 54013202 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, cost: 53968911 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:4, cost: 48077276 ns
mlx5_core 0000:41:00.0: CPU: 0, COMM: kworker/0:2 cost: 506256717 ns
mlx5_core 0000:41:00.1: CPU: 0, COMM: kworker/0:2 cost: 514289394 ns
usb 1-2.4: CPU: 163, COMM: (udev-worker), cost 854131 ns
usb 1-2.6: CPU: 163, COMM: (udev-worker), cost 967993 ns
intel_rapl_msr intel_rapl_msr.0: CPU: 61, COMM: (udev-worker), cost: 3717567 ns
2. With the patchset
- PCI probes are spread across CPUs inside the device’s NUMA node
- Asynchronous nvme probes are ~35 % faster; synchronous mlx5 times
are unchanged
- USB probe times are virtually identical
- Platform driver (no NUMA node) falls back to the original path
nvme 0000:01:00.0: CPU: 3, COMM: kworker/u1025:1, cost: 34244206 ns
nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, cost: 33883391 ns
nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, cost: 33943040 ns
mlx5_core 0000:41:00.0: CPU: 3, COMM: kworker/u1025:1, cost: 507206174 ns
mlx5_core 0000:41:00.1: CPU: 3, COMM: kworker/u1025:1, cost: 514927642 ns
usb 1-2.4: CPU: 4, COMM: kworker/u1025:8, cost: 991417 ns
usb 1-2.6: CPU: 2, COMM: kworker/u1025:5, cost: 935112 ns
intel_rapl_msr intel_rapl_msr.0: CPU: 17, COMM: (udev-worker), cost: 4849967 ns
3. With the patchset, unbind/bind cycles also spread PCI probes across
CPUs within the device’s NUMA node:
nvme 0000:02:00.0: CPU: 130, COMM: kworker/u1025:4, cost: 35086209 ns
** Final **
Comments and suggestions are welcome.
Best Regards,
Jinhui
---
v1: https://lore.kernel.org/all/20260107175548.1792-1-guojinhui.liam@bytedance.com/
Changelog in v1 -> v2:
- Reword the first patch’s commit message for accuracy and add
Reviewed-by tags; no code changes.
- Refactor the second patch to reduce complexity: introduce
__driver_probe_device_node() and update the signature of
driver_probe_device() to support NUMA-node-aware synchronous
probing. (suggested by Danilo)
- The third patch resolves conflicts with three patches from
patchset [2] that have since been merged into linux-next.git.
- Update the test data in the cover letter for the new patchset.
[2] https://lore.kernel.org/all/20260101221359.22298-1-frederic@kernel.org/
Jinhui Guo (3):
driver core: Introduce helper function __device_attach_driver_scan()
driver core: Add NUMA-node awareness to the synchronous probe path
PCI: Clean up NUMA-node awareness in pci_bus_type probe
drivers/base/dd.c | 147 +++++++++++++++++++++++++++++----------
drivers/pci/pci-driver.c | 116 +++---------------------------
include/linux/pci.h | 4 --
kernel/sched/isolation.c | 2 -
4 files changed, 118 insertions(+), 151 deletions(-)
--
2.20.1
Powered by blists - more mailing lists