netdev - Re: [PATCH net] net/mlx5: Avoid deadlock between PCI error recovery and health reporter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2fa90540-0560-4d5b-9f3a-27ade6c92d01@nvidia.com>
Date: Mon, 11 Aug 2025 15:04:23 +0300
From: Shay Drori <shayd@...dia.com>
To: Gerd Bayer <gbayer@...ux.ibm.com>, Saeed Mahameed <saeedm@...dia.com>,
	Leon Romanovsky <leon@...nel.org>, Tariq Toukan <tariqt@...dia.com>, "Mark
 Bloch" <mbloch@...dia.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni
	<pabeni@...hat.com>, Moshe Shemesh <moshe@...dia.com>
CC: Niklas Schnelle <schnelle@...ux.ibm.com>, Alexandra Winter
	<wintera@...ux.ibm.com>, Andrew Lunn <andrew+netdev@...n.ch>, "David S.
 Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, "Yuanyuan
 Zhong" <yzhong@...estorage.com>, Mohamed Khalfella
	<mkhalfella@...estorage.com>, <netdev@...r.kernel.org>,
	<linux-rdma@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH net] net/mlx5: Avoid deadlock between PCI error recovery
 and health reporter



On 11/08/2025 10:29, Gerd Bayer wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Sun, 2025-08-10 at 14:51 +0300, Shay Drori wrote:
>> On 07/08/2025 16:11, Gerd Bayer wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> During error recovery testing a pair of tasks was reported to be hung
>>> due to a dead-lock situation:
>>>
>>> - mlx5_unload_one() trying to acquire devlink lock while the PCI error
>>>     recovery code had acquired the pci_cfg_access_lock().
>>
>> could you please add traces here?
>> I looked at the code and didn't see where pci_cfg_access_lock() is
>> taken...
> 
> Sure thing. This is the original hung task message:
> 
> 10144.859042] mlx5_core 0000:00:00.1: mlx5_health_try_recover:338:(pid 5553): health recovery flow aborted, PCI reads still not working
> [10320.359160] INFO: task kmcheck:72 blocked for more than 122 seconds.
> [10320.359169]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359172] task:kmcheck         state:D stack:0     pid:72    tgid:72    ppid:2      flags:0x00000000
> [10320.359176] Call Trace:
> [10320.359178]  [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359187]  [<000000065256f356>] schedule+0x36/0xe0
> [10320.359189]  [<000000065256f572>] schedule_preempt_disabled+0x22/0x30
> [10320.359192]  [<0000000652570a94>] __mutex_lock.constprop.0+0x484/0x8a8
> [10320.359194]  [<000003ff800673a4>] mlx5_unload_one+0x34/0x58 [mlx5_core]
> [10320.359360]  [<000003ff8006745c>] mlx5_pci_err_detected+0x94/0x140 [mlx5_core]
> [10320.359400]  [<0000000652556c5a>] zpci_event_attempt_error_recovery+0xf2/0x398
> [10320.359406]  [<0000000651b9184a>] __zpci_event_error+0x23a/0x2c0
> [10320.359411]  [<00000006522b3958>] chsc_process_event_information.constprop.0+0x1c8/0x1e8
> [10320.359416]  [<00000006522baf1a>] crw_collect_info+0x272/0x338
> [10320.359418]  [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359422]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359425]  [<0000000652576642>] ret_from_fork+0xa/0x30
> [10320.359440] INFO: task kworker/u1664:6:1514 blocked for more than 122 seconds.
> [10320.359441]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359443] task:kworker/u1664:6 state:D stack:0     pid:1514  tgid:1514  ppid:2      flags:0x00000000
> [10320.359447] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
> [10320.359492] Call Trace:
> [10320.359521]  [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359524]  [<000000065256f356>] schedule+0x36/0xe0
> [10320.359526]  [<0000000652172e28>] pci_wait_cfg+0x80/0xe8
> [10320.359532]  [<0000000652172f94>] pci_cfg_access_lock+0x74/0x88
> [10320.359534]  [<000003ff800916b6>] mlx5_vsc_gw_lock+0x36/0x178 [mlx5_core]
> [10320.359585]  [<000003ff80098824>] mlx5_crdump_collect+0x34/0x1c8 [mlx5_core]
> [10320.359637]  [<000003ff80074b62>] mlx5_fw_fatal_reporter_dump+0x6a/0xe8 [mlx5_core]
> [10320.359680]  [<0000000652512242>] devlink_health_do_dump.part.0+0x82/0x168
> [10320.359683]  [<0000000652513212>] devlink_health_report+0x19a/0x230
> [10320.359685]  [<000003ff80075a12>] mlx5_fw_fatal_reporter_err_work+0xba/0x1b0 [mlx5_core]
> [10320.359728]  [<0000000651bbf852>] process_one_work+0x1c2/0x458
> [10320.359733]  [<0000000651bc073e>] worker_thread+0x3ce/0x528
> [10320.359735]  [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359737]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359739]  [<0000000652576642>] ret_from_fork+0xa/0x30
> 
> The pci_config_access_lock() is acquired in zpci_event_attempt_error_recovery() by way of pci_dev_lock().

Thanks a lot!
can you please add the above to the commit message?

snip
<...>


>>> ---
>>>    drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 7 +++----
>>>    1 file changed, 3 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
>>> index 432c98f2626d..d2d3b57a57d5 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
>>> @@ -73,16 +73,15 @@ int mlx5_vsc_gw_lock(struct mlx5_core_dev *dev)
>>>           u32 lock_val;
>>>           int ret;
>>>
>>> +       if (pci_channel_offline(dev->pdev))
>>> +               return -EACCES;
>>> +

There is still a race here.
it is possible that mlx5 have passed the above check, while
zpci_event_attempt_error_recovery() already took the cfg_look but still
didn't change the pdev to error state :(


>>>           pci_cfg_access_lock(dev->pdev);
>>>           do {
>>>                   if (retries > VSC_MAX_RETRIES) {
>>>                           ret = -EBUSY;
>>>                           goto pci_unlock;
>>>                   }
>>> -               if (pci_channel_offline(dev->pdev)) {
>>> -                       ret = -EACCES;
>>> -                       goto pci_unlock;
>>> -               }
>>>
>>>                   /* Check if semaphore is already locked */
>>>                   ret = vsc_read(dev, VSC_SEMAPHORE_OFFSET, &lock_val);
>>> --
>>> 2.48.1
>>>