[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <fe8a89b501e44737821fe8b0ab4492e9@huawei.com>
Date: Wed, 14 Jan 2026 09:49:44 +0000
From: Kangfenglong <kangfenglong@...wei.com>
To: "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "shenyang
(M)" <shenyang39@...wei.com>, "Zengtao (B)" <prime.zeng@...ilicon.com>,
"shenjian (K)" <shenjian15@...wei.com>, "Wangyu (Eric)"
<seven.wangyu@...wei.com>
Subject: [BUG] PCI/DPC: NULL pointer dereference in pci_bus_read_config_dword
during DPC recovery racing with hotplug
Hello Linux PCI maintainers,
I'm reporting a critical use-after-free bug in the PCIe DPC recovery
mechanism that occurs when DPC recovery races with hotplug removal.
This results in a NULL pointer dereference in pci_bus_read_config_dword
and system panic.
## 1. Bug Summary
Title: Use-after-free race between DPC recovery and hotplug removal
Root Cause: DPC recovery accesses pci_dev after hotplug freed it
Severity: High (kernel panic, system crash)
Kernel Version: 5.10.0
Architecture: ARM64 (aarch64)
## 2. Timeline Analysis
### 2.1 DPC Event (T=1411s)
[ 1411.421320][ T2808] pcieport 0000:5f:00.0: DPC: containment event
[ 1411.463810][ T2808] pcieport 0000:5f:00.0: PCIe Bus Error: severity=Uncorrected
DPC recovery starts on port 0000:5f:00.0 (thread T2808)
### 2.2 Hotplug Link Down (T=1415s)
[ 1415.621146][ T2807] pcieport 0000:5f:00.0: pciehp: Slot(3): Link Down
Hotplug detects link down on same slot (thread T2807).
RACE BEGINS: DPC and hotplug run concurrently.
### 2.3 Device Recovery Fails (T=1415s-1479s)
[ 1415.650432][ T8030] mlx5_core 0000:60:00.1: mlx5_error_sw_reset: start
[ 1479.604410][ T8030] mlx5_health_try_recover: health recovery flow aborted
mlx5 driver recovery fails.
### 2.4 Hotplug Removal Starts (T=1479s)
[ 1479.962953][ T2808] pci 0000:60:00.1: AER: can't recover
[ 1479.962961][ T2807] pci 0000:60:00.1: Removing from iommu group 67
Hotplug begins removing device 0000:60:00.1.
### 2.5 DPC Recovery Continues (T=1481s)
[ 1481.412327][ T2808] pci 0000:60:00.0: not ready 1023ms after DPC
DPC recovery still active for sibling device 0000:60:00.0.
### 2.6 Hotplug Re-enumeration (T=1481s)
[ 1481.892474][ T2807] pciehp: Slot(3): Card present
[ 1481.899188][ T2807] pciehp: Slot(3): Link Up
Hotplug re-detects and re-enumerates device.
### 2.7 The Crash (T=1484s)
[ 1484.742859][ T2808] Unable to handle kernel NULL pointer dereference
[ 1485.009989][ T2808] pc : pci_bus_read_config_dword+0x1b0/0x350
## 3. Call Trace
[ 1484.742859][ T2808] Unable to handle kernel NULL pointer dereference
[ 1484.814910][ T2808] Internal error: Oops: 0000000096000004 [#1] SMP
[ 1485.009989][ T2808] pc : pci_bus_read_config_dword+0x1b0/0x350
[ 1485.015808][ T2808] lr : pci_read_config_dword+0x4c/0xa0
[ 1485.120656][ T2808] pci_bus_read_config_dword+0x1b0/0x350
[ 1485.126130][ T2808] pci_read_config_dword+0x4c/0xa0
[ 1485.131084][ T2808] pci_dev_wait+0xf0/0x230
[ 1485.135341][ T2808] pci_bridge_wait_for_secondary_bus+0x204/0x4f0
[ 1485.141509][ T2808] dpc_reset_link+0x12c/0x534
[ 1485.146027][ T2808] pcie_do_recovery+0x26c/0x6e0
[ 1485.150718][ T2808] dpc_handler+0x90/0x170
[ 1485.154890][ T2808] irq_thread_fn+0x50/0x180
[ 1485.159234][ T2808] irq_thread+0x144/0x210
[ 1485.163407][ T2808] kthread+0x190/0x210
[ 1485.167318][ T2808] ret_from_fork+0x10/0x18
## 4. Root Cause
I speculate that an use-after-free race condition cause this issue:
Thread T2808 (DPC):
dpc_handler() -> pcie_do_recovery() -> dpc_reset_link() ->
pci_bridge_wait_for_secondary_bus() -> pci_dev_wait() ->
pci_read_config_dword()
Thread T2807 (hotplug):
Detects link down -> removes device -> frees pci_dev ->
re-enumerates device -> creates new pci_dev
The DPC recovery assumes device stability, but hotplug can free and
recreate pci_dev concurrently.
## 5. System Information
Hardware: HiSilicon ARM64 server
CPU: ARM64, core 11 affected
NIC: Mellanox ConnectX-6 (0000:60:00.0/1)
Upstream Port: 0000:5f:00.0
Kernel: 5.10.0
Modules: mlx5_core,
Config: CONFIG_HOTPLUG_PCI_PCIE=y, CONFIG_PCIE_DPC=y
## 6. Questions
1. Existing synchronization between DPC and hotplug?
2. Should DPC use pci_dev_get()/pci_dev_put()?
3. Similar fixes in newer kernels?
4. Preferred approach?
Thank you.
Best regards
Powered by blists - more mailing lists