linux-kernel - Re: [PATCH net] net/mlx5: Fix double unregister of HCA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <affc62bb-a824-4433-b746-45bb8dd3654b@gmail.com>
Date: Thu, 4 Dec 2025 11:00:07 +0200
From: Tariq Toukan <ttoukan.linux@...il.com>
To: Gerd Bayer <gbayer@...ux.ibm.com>, Saeed Mahameed <saeedm@...dia.com>,
 Leon Romanovsky <leon@...nel.org>, Tariq Toukan <tariqt@...dia.com>,
 Mark Bloch <mbloch@...dia.com>, Andrew Lunn <andrew+netdev@...n.ch>,
 "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
 Shay Drory <shayd@...dia.com>, Simon Horman <horms@...nel.org>
Cc: Lukas Wunner <lukas@...ner.de>, Bjorn Helgaas <helgaas@...nel.org>,
 Niklas Schnelle <schnelle@...ux.ibm.com>, Farhan Ali <alifm@...ux.ibm.com>,
 netdev@...r.kernel.org, linux-rdma@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-s390@...r.kernel.org,
 linux-pci@...r.kernel.org
Subject: Re: [PATCH net] net/mlx5: Fix double unregister of HCA_PORTS
 component



On 02/12/2025 13:12, Gerd Bayer wrote:
> Clear hca_devcom_comp in device's private data after unregistering it in
> LAG teardown. Otherwise a slightly lagging second pass through
> mlx5_unload_one() might try to unregister it again and trip over
> use-after-free.
> 
> On s390 almost all PCI level recovery events trigger two passes through
> mxl5_unload_one() - one through the poll_health() method and one through
> mlx5_pci_err_detected() as callback from generic PCI error recovery.
> While testing PCI error recovery paths with more kernel debug features
> enabled, this issue reproducibly led to kernel panics with the following
> call chain:
> 
>   Unable to handle kernel pointer dereference in virtual kernel address space
>   Failing address: 6b6b6b6b6b6b6000 TEID: 6b6b6b6b6b6b6803 ESOP-2 FSI
>   Fault in home space mode while using kernel ASCE.
>   AS:00000000705c4007 R3:0000000000000024
>   Oops: 0038 ilc:3 [#1]SMP
> 
>   CPU: 14 UID: 0 PID: 156 Comm: kmcheck Kdump: loaded Not tainted
>        6.18.0-20251130.rc7.git0.16131a59cab1.300.fc43.s390x+debug #1 PREEMPT
> 
>   Krnl PSW : 0404e00180000000 0000020fc86aa1dc (__lock_acquire+0x5c/0x15f0)
>              R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>   Krnl GPRS: 0000000000000000 0000020f00000001 6b6b6b6b6b6b6c33 0000000000000000
>              0000000000000000 0000000000000000 0000000000000001 0000000000000000
>              0000000000000000 0000020fca28b820 0000000000000000 0000010a1ced8100
>              0000010a1ced8100 0000020fc9775068 0000018fce14f8b8 0000018fce14f7f8
>   Krnl Code: 0000020fc86aa1cc: e3b003400004        lg      %r11,832
>              0000020fc86aa1d2: a7840211           brc     8,0000020fc86aa5f4
>             *0000020fc86aa1d6: c09000df0b25       larl    %r9,0000020fca28b820
>             >0000020fc86aa1dc: d50790002000       clc     0(8,%r9),0(%r2)
>              0000020fc86aa1e2: a7840209           brc     8,0000020fc86aa5f4
>              0000020fc86aa1e6: c0e001100401       larl    %r14,0000020fca8aa9e8
>              0000020fc86aa1ec: c01000e25a00       larl    %r1,0000020fca2f55ec
>              0000020fc86aa1f2: a7eb00e8           aghi    %r14,232
> 
>   Call Trace:
>    __lock_acquire+0x5c/0x15f0
>    lock_acquire.part.0+0xf8/0x270
>    lock_acquire+0xb0/0x1b0
>    down_write+0x5a/0x250
>    mlx5_detach_device+0x42/0x110 [mlx5_core]
>    mlx5_unload_one_devl_locked+0x50/0xc0 [mlx5_core]
>    mlx5_unload_one+0x42/0x60 [mlx5_core]
>    mlx5_pci_err_detected+0x94/0x150 [mlx5_core]
>    zpci_event_attempt_error_recovery+0xcc/0x388
> 
> Fixes: 5a977b5833b7 ("net/mlx5: Lag, move devcom registration to LAG layer")
> Signed-off-by: Gerd Bayer <gbayer@...ux.ibm.com>
> ---
> Hi Shay et al,
> 
> while checking for potential regressions by Lukas Wunner's recent work
> on pci_save/restore_state() for the recoverability of mlx5 functions I
> consistently hit this bug. (Bjorn has queued this up for 6.19, according
> to [0] and [1])
> 
> Apparently, the issue is unrelated to Lukas' work but can be reproduced
> with master. It appears to be timing-sensitive, since it shows up only
> when I use s390's debug_defconfig, but I think needs fixing anyhow, as
> timing can change for other reasons, too.
> 
> I've spotted two additional places where the devcom reference is not
> cleared after calling mlx5_devcom_unregister_component() in
> drivers/net/ethernet/mellanox/mlx5/core/lib/sd.c that I have not
> addressed with a patch, since I'm unclear about how to test these
> paths.
> 
> Thanks,
> Gerd
> 
> [0] https://lore.kernel.org/all/cover.1760274044.git.lukas@wunner.de/
> [1] https://lore.kernel.org/linux-pci/cover.1763483367.git.lukas@wunner.de/
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> index 3db0387bf6dcb727a65df9d0253f242554af06db..8ec04a5f434dd4f717d6d556649fcc2a584db847 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lag/lag.c
> @@ -1413,6 +1413,7 @@ static int __mlx5_lag_dev_add_mdev(struct mlx5_core_dev *dev)
>   static void mlx5_lag_unregister_hca_devcom_comp(struct mlx5_core_dev *dev)
>   {
>   	mlx5_devcom_unregister_component(dev->priv.hca_devcom_comp);
> +	dev->priv.hca_devcom_comp = NULL;
>   }
>   
>   static int mlx5_lag_register_hca_devcom_comp(struct mlx5_core_dev *dev)
> 
> ---
> base-commit: 4a26e7032d7d57c998598c08a034872d6f0d3945
> change-id: 20251202-fix_lag-6a59b39a0b3c
> 
> Best regards,

Thanks for your patch.
Acked-by: Tariq Toukan <tariqt@...dia.com>