lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250529075849.243096-1-zhaochenguang@kylinos.cn>
Date: Thu, 29 May 2025 15:58:49 +0800
From: Chenguang Zhao <zhaochenguang@...inos.cn>
To: Saeed Mahameed <saeedm@...dia.com>,
	Leon Romanovsky <leon@...nel.org>,
	Tariq Toukan <tariqt@...dia.com>,
	Andrew Lunn <andrew+netdev@...n.ch>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>,
	Paolo Abeni <pabeni@...hat.com>
Cc: Chenguang Zhao <zhaochenguang@...inos.cn>,
	netdev@...r.kernel.org,
	linux-rdma@...r.kernel.org
Subject: [PATCH net v2] net/mlx5: Flag state up only after cmdif is ready

When driver is reloading during recovery flow, it can't get new commands
till command interface is up again. Otherwise we may get to null pointer
trying to access non initialized command structures.

The issue can be reproduced using the following script:

1)Use following script to trigger PCI error.

for((i=1;i<1000;i++));
do
echo 1 > /sys/bus/pci/devices/0000\:01\:00.0/reset
echo “pci reset test $i times”
done

2) Use following script to read speed.

while true; do cat /sys/class/net/eth0/speed &> /dev/null; done

task: ffff885f42820fd0 ti: ffff88603f758000 task.ti: ffff88603f758000
RIP: 0010:[] [] dma_pool_alloc+0x1ab/0×290
RSP: 0018:ffff88603f75baf0 EFLAGS: 00010046
RAX: 0000000000000246 RBX: ffff882f77d90c80 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 00000000000080d0 RDI: ffff882f77d90d10
RBP: ffff88603f75bb20 R08: 0000000000019ba0 R09: ffff88017fc07c00
R10: ffffffffc0a9c384 R11: 0000000000000246 R12: ffff882f77d90d00
R13: 00000000000080d0 R14: ffff882f77d90d10 R15: ffff88340b6c5ea8
FS: 00007efce8330740(0000) GS:ffff885f4da00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000003454fc6000 CR4: 00000000003407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call trace:
 mlx5_alloc_cmd_msg+0xb4/0×2a0 [mlx5_core]
 mlx5_alloc_cmd_msg+0xd3/0×2a0 [mlx5_core]
 cmd_exec+0xcf/0×8a0 [mlx5_core]
 mlx5_cmd_exec+0x33/0×50 [mlx5_core]
 mlx5_core_access_reg+0xf1/0×170 [mlx5_core]
 mlx5_query_port_ptys+0x64/0×70 [mlx5_core]
 mlx5e_get_link_ksettings+0x5c/0×360 [mlx5_core]
 __ethtool_get_link_ksettings+0xa6/0×210
 speed_show+0x78/0xb0
 dev_attr_show+0x23/0×60
 sysfs_read_file+0x99/0×190
 vfs_read+0x9f/0×170
 SyS_read+0x7f/0xe0
 tracesys+0xe3/0xe8

Fixes: a80d1b68c8b7a0 ("net/mlx5: Break load_one into three stages")
Signed-off-by: Chenguang Zhao <zhaochenguang@...inos.cn>
---
v2:
 - Add net subject prefix as Paolo suggested
 - Add Fixes tag
 - Add issus reproduction script and call trace

v1:
 https://lore.kernel.org/all/20250527013723.242599-1-zhaochenguang@kylinos.cn/
---
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 41e8660c819c..713f1f4f2b42 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1210,6 +1210,9 @@ static int mlx5_function_enable(struct mlx5_core_dev *dev, bool boot, u64 timeou
 	dev->caps.embedded_cpu = mlx5_read_embedded_cpu(dev);
 	mlx5_cmd_set_state(dev, MLX5_CMDIF_STATE_UP);
 
+	/* remove any previous indication of internal error */
+	dev->state = MLX5_DEVICE_STATE_UP;
+
 	err = mlx5_core_enable_hca(dev, 0);
 	if (err) {
 		mlx5_core_err(dev, "enable hca failed\n");
@@ -1602,8 +1605,6 @@ int mlx5_load_one_devl_locked(struct mlx5_core_dev *dev, bool recovery)
 		mlx5_core_warn(dev, "interface is up, NOP\n");
 		goto out;
 	}
-	/* remove any previous indication of internal error */
-	dev->state = MLX5_DEVICE_STATE_UP;
 
 	if (recovery)
 		timeout = mlx5_tout_ms(dev, FW_PRE_INIT_ON_RECOVERY_TIMEOUT);
-- 
2.25.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ