linux-kernel - Re: [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <358e3a7a1e735d5b1e761cf159db8ae735a9578d.camel@linux.ibm.com>
Date: Wed, 15 Oct 2025 16:10:33 +0200
From: Niklas Schnelle <schnelle@...ux.ibm.com>
To: Gerd Bayer <gbayer@...ux.ibm.com>,
        Gerald Schaefer	
 <gerald.schaefer@...ux.ibm.com>,
        Heiko Carstens <hca@...ux.ibm.com>, Vasily
 Gorbik <gor@...ux.ibm.com>,
        Alexander Gordeev <agordeev@...ux.ibm.com>,
        Shay Drori	 <shayd@...dia.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: Tariq Toukan <tariqt@...dia.com>, Saeed Mahameed <saeedm@...dia.com>,
        Leon Romanovsky	 <leon@...nel.org>,
        Christian Borntraeger
 <borntraeger@...ux.ibm.com>,
        Sven Schnelle <svens@...ux.ibm.com>,
        Pierre
 Morel <pmorel@...ux.ibm.com>,
        Matthew Rosato	 <mjrosato@...ux.ibm.com>, linux-s390@...r.kernel.org,
        linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
        linux-rdma@...r.kernel.org
Subject: Re: [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery
 and mlx5 crdump

On Wed, 2025-10-15 at 13:05 +0200, Gerd Bayer wrote:
> Do not block PCI config accesses through pci_cfg_access_lock() when
> executing the s390 variant of PCI error recovery: Acquire just
> device_lock() instead of pci_dev_lock() as powerpc's EEH and
> generig PCI AER processing do.
> 
> During error recovery testing a pair of tasks was reported to be hung:
> 
> [10144.859042] mlx5_core 0000:00:00.1: mlx5_health_try_recover:338:(pid 5553): health recovery flow aborted, PCI reads still not working
> [10320.359160] INFO: task kmcheck:72 blocked for more than 122 seconds.
> [10320.359169]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359172] task:kmcheck         state:D stack:0     pid:72    tgid:72    ppid:2      flags:0x00000000
> [10320.359176] Call Trace:
> [10320.359178]  [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359187]  [<000000065256f356>] schedule+0x36/0xe0
> [10320.359189]  [<000000065256f572>] schedule_preempt_disabled+0x22/0x30
> [10320.359192]  [<0000000652570a94>] __mutex_lock.constprop.0+0x484/0x8a8
> [10320.359194]  [<000003ff800673a4>] mlx5_unload_one+0x34/0x58 [mlx5_core]
> [10320.359360]  [<000003ff8006745c>] mlx5_pci_err_detected+0x94/0x140 [mlx5_core]
> [10320.359400]  [<0000000652556c5a>] zpci_event_attempt_error_recovery+0xf2/0x398
> [10320.359406]  [<0000000651b9184a>] __zpci_event_error+0x23a/0x2c0
> [10320.359411]  [<00000006522b3958>] chsc_process_event_information.constprop.0+0x1c8/0x1e8
> [10320.359416]  [<00000006522baf1a>] crw_collect_info+0x272/0x338
> [10320.359418]  [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359422]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359425]  [<0000000652576642>] ret_from_fork+0xa/0x30
> [10320.359440] INFO: task kworker/u1664:6:1514 blocked for more than 122 seconds.
> [10320.359441]       Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359443] task:kworker/u1664:6 state:D stack:0     pid:1514  tgid:1514  ppid:2      flags:0x00000000
> [10320.359447] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
> [10320.359492] Call Trace:
> [10320.359521]  [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359524]  [<000000065256f356>] schedule+0x36/0xe0
> [10320.359526]  [<0000000652172e28>] pci_wait_cfg+0x80/0xe8
> [10320.359532]  [<0000000652172f94>] pci_cfg_access_lock+0x74/0x88
> [10320.359534]  [<000003ff800916b6>] mlx5_vsc_gw_lock+0x36/0x178 [mlx5_core]
> [10320.359585]  [<000003ff80098824>] mlx5_crdump_collect+0x34/0x1c8 [mlx5_core]
> [10320.359637]  [<000003ff80074b62>] mlx5_fw_fatal_reporter_dump+0x6a/0xe8 [mlx5_core]
> [10320.359680]  [<0000000652512242>] devlink_health_do_dump.part.0+0x82/0x168
> [10320.359683]  [<0000000652513212>] devlink_health_report+0x19a/0x230
> [10320.359685]  [<000003ff80075a12>] mlx5_fw_fatal_reporter_err_work+0xba/0x1b0 [mlx5_core]
> [10320.359728]  [<0000000651bbf852>] process_one_work+0x1c2/0x458
> [10320.359733]  [<0000000651bc073e>] worker_thread+0x3ce/0x528
> [10320.359735]  [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359737]  [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359739]  [<0000000652576642>] ret_from_fork+0xa/0x30

I'd tend to prune this a bit, at the very least I would remove the time
stamp prefix.

> 
> No kernel log of the exact same error with an upstream kernel is
> available - but the very same deadlock situation can be constructed there,
> too:
> 
> - task: kmcheck
>   mlx5_unload_one() tries to acquire devlink lock while the PCI error
>   recovery code has set pdev->block_cfg_access by way of
>   pci_cfg_access_lock()
> - task: kworker
>   mlx5_crdump_collect() tries to set block_cfg_access through
>   pci_cfg_access_lock() while devlink_health_report() had acquired
>   the devlink lock.
> 
> A similar deadlock situation can be reproduced by requesting a
> crdump with
>   > devlink health dump show pci/<BDF> reporter fw_fatal
> 
> while PCI error recovery is executed on the same <BDF> physical function
> by mlx5_core's pci_error_handlers. On s390 this can be injected with
>   > zpcictl --reset-fw <BDF>
> 
> Tests with this patch failed to reproduce that second deadlock situation,
> the devlink command is rejected with "kernel answers: Permission denied" -
> and we get a kernel log message of:
> 
> Oct 14 13:32:39 b46lp03.lnxne.boe kernel: mlx5_core 1ed0:00:00.1: mlx5_crdump_collect:50:(pid 254382): crdump: failed to lock vsc gw err -5

Same as above I'd remove everyting before the "mlx5_core: …" as that
adds no relevant information.

> 
> because the config read of VSC_SEMAPHORE is rejected by the underlying
> hardware.
> 
> Two prior attempts to address this issue have been discussed and
> ultimately rejected [see link], with the primary argument that s390's
> implementation of PCI error recovery is imposing restrictions that
> neither powerpc's EEH nor PCI AER handling need. Tests show that PCI
> error recovery on s390 is running to completion even without blocking
> access to PCI config space.
> 
> Link: https://lore.kernel.org/all/20251007144826.2825134-1-gbayer@linux.ibm.com/
> 
> Fixes: 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery")

Besides a Fixes tag let's add "Cc: stable@...r.kernel.org" and you can
put that instead of the empty line between Link and Fixes since usually
these are an uninterrupted block of tags.

> Signed-off-by: Gerd Bayer <gbayer@...ux.ibm.com>
> ---
> All,
> 
> sorry for the immediate v2, but I had to rebase this to a current
> upstream commit since v1 didn't apply cleanly, as Niklas pointed out in
> private. The following assessment from v1 is still valid, though:
> 
> Hi Niklas, Shay, Jason,
> 
> by now I believe fixing this in s390/pci is the right way to go, since
> the other PCI error recovery implementations apparently don't require
> this strict blocking of accesses to the PCI config space.
>     
> Hi Alexander, Vasily, Heiko,
>     
> while I sent this to netdev since prior versions were discussed there,
> I assume this patch will go through the s390 tree, right?
>     
> Thanks,
> Gerd
> 
>     
> ---
>  arch/s390/pci/pci_event.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
> index b95376041501f479eee20705d45fb8c68553da71..27db1e72c623f8a289cae457e87f0a9896ed241d 100644
> --- a/arch/s390/pci/pci_event.c
> +++ b/arch/s390/pci/pci_event.c
> @@ -188,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  	 * is unbound or probed and that userspace can't access its
>  	 * configuration space while we perform recovery.
>  	 */
> -	pci_dev_lock(pdev);
> +	device_lock(&pdev->dev);
>  	if (pdev->error_state == pci_channel_io_perm_failure) {
>  		ers_res = PCI_ERS_RESULT_DISCONNECT;
>  		goto out_unlock;
> @@ -257,7 +257,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
>  		driver->err_handler->resume(pdev);
>  	pci_uevent_ers(pdev, PCI_ERS_RESULT_RECOVERED);
>  out_unlock:
> -	pci_dev_unlock(pdev);
> +	device_unlock(&pdev->dev);
>  	zpci_report_status(zdev, "recovery", status_str);
>  
>  	return ers_res;

Code-wise this looks good to me and I've confirmed that EEH and AER
indeed only use the device_lock() and I don't see a reason why that
shouldn't be enough for us too if it is enough for them. I think I just
picked pci_dev_lock() because it seemed fitting but it probably was too
big a hammer.

Reviewed-by: Niklas Schnelle <schnelle@...ux.ibm.com>