[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250529075849.243096-1-zhaochenguang@kylinos.cn>
Date: Thu, 29 May 2025 15:58:49 +0800
From: Chenguang Zhao <zhaochenguang@...inos.cn>
To: Saeed Mahameed <saeedm@...dia.com>,
Leon Romanovsky <leon@...nel.org>,
Tariq Toukan <tariqt@...dia.com>,
Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>
Cc: Chenguang Zhao <zhaochenguang@...inos.cn>,
netdev@...r.kernel.org,
linux-rdma@...r.kernel.org
Subject: [PATCH net v2] net/mlx5: Flag state up only after cmdif is ready
When driver is reloading during recovery flow, it can't get new commands
till command interface is up again. Otherwise we may get to null pointer
trying to access non initialized command structures.
The issue can be reproduced using the following script:
1)Use following script to trigger PCI error.
for((i=1;i<1000;i++));
do
echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
echo “pci reset test $i times”
done
2) Use following script to read speed.
while true; do cat /sys/class/net/eth0/speed &> /dev/null; done
task: ffff885f42820fd0 ti: ffff88603f758000 task.ti: ffff88603f758000
RIP: 0010:[] [] dma_pool_alloc+0x1ab/0×290
RSP: 0018:ffff88603f75baf0 EFLAGS: 00010046
RAX: 0000000000000246 RBX: ffff882f77d90c80 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00000000000080d0 RDI: ffff882f77d90d10
RBP: ffff88603f75bb20 R08: 0000000000019ba0 R09: ffff88017fc07c00
R10: ffffffffc0a9c384 R11: 0000000000000246 R12: ffff882f77d90d00
R13: 00000000000080d0 R14: ffff882f77d90d10 R15: ffff88340b6c5ea8
FS: 00007efce8330740(0000) GS:ffff885f4da00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000003454fc6000 CR4: 00000000003407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call trace:
mlx5_alloc_cmd_msg+0xb4/0×2a0 [mlx5_core]
mlx5_alloc_cmd_msg+0xd3/0×2a0 [mlx5_core]
cmd_exec+0xcf/0×8a0 [mlx5_core]
mlx5_cmd_exec+0x33/0×50 [mlx5_core]
mlx5_core_access_reg+0xf1/0×170 [mlx5_core]
mlx5_query_port_ptys+0x64/0×70 [mlx5_core]
mlx5e_get_link_ksettings+0x5c/0×360 [mlx5_core]
__ethtool_get_link_ksettings+0xa6/0×210
speed_show+0x78/0xb0
dev_attr_show+0x23/0×60
sysfs_read_file+0x99/0×190
vfs_read+0x9f/0×170
SyS_read+0x7f/0xe0
tracesys+0xe3/0xe8
Fixes: a80d1b68c8b7a0 ("net/mlx5: Break load_one into three stages")
Signed-off-by: Chenguang Zhao <zhaochenguang@...inos.cn>
---
v2:
- Add net subject prefix as Paolo suggested
- Add Fixes tag
- Add issus reproduction script and call trace
v1:
https://lore.kernel.org/all/20250527013723.242599-1-zhaochenguang@kylinos.cn/
---
drivers/net/ethernet/mellanox/mlx5/core/main.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 41e8660c819c..713f1f4f2b42 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1210,6 +1210,9 @@ static int mlx5_function_enable(struct mlx5_core_dev *dev, bool boot, u64 timeou
dev->caps.embedded_cpu = mlx5_read_embedded_cpu(dev);
mlx5_cmd_set_state(dev, MLX5_CMDIF_STATE_UP);
+ /* remove any previous indication of internal error */
+ dev->state = MLX5_DEVICE_STATE_UP;
+
err = mlx5_core_enable_hca(dev, 0);
if (err) {
mlx5_core_err(dev, "enable hca failed\n");
@@ -1602,8 +1605,6 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
mlx5_core_warn(dev, "interface is up, NOP\n");
goto out;
}
- /* remove any previous indication of internal error */
- dev->state = MLX5_DEVICE_STATE_UP;
if (recovery)
timeout = mlx5_tout_ms(dev, FW_PRE_INIT_ON_RECOVERY_TIMEOUT);
--
2.25.1
Powered by blists - more mailing lists