[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <358e3a7a1e735d5b1e761cf159db8ae735a9578d.camel@linux.ibm.com>
Date: Wed, 15 Oct 2025 16:10:33 +0200
From: Niklas Schnelle <schnelle@...ux.ibm.com>
To: Gerd Bayer <gbayer@...ux.ibm.com>,
Gerald Schaefer
<gerald.schaefer@...ux.ibm.com>,
Heiko Carstens <hca@...ux.ibm.com>, Vasily
Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev <agordeev@...ux.ibm.com>,
Shay Drori <shayd@...dia.com>, Jason Gunthorpe <jgg@...pe.ca>
Cc: Tariq Toukan <tariqt@...dia.com>, Saeed Mahameed <saeedm@...dia.com>,
Leon Romanovsky <leon@...nel.org>,
Christian Borntraeger
<borntraeger@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>,
Pierre
Morel <pmorel@...ux.ibm.com>,
Matthew Rosato <mjrosato@...ux.ibm.com>, linux-s390@...r.kernel.org,
linux-kernel@...r.kernel.org, netdev@...r.kernel.org,
linux-rdma@...r.kernel.org
Subject: Re: [PATCH v2] s390/pci: Avoid deadlock between PCI error recovery
and mlx5 crdump
On Wed, 2025-10-15 at 13:05 +0200, Gerd Bayer wrote:
> Do not block PCI config accesses through pci_cfg_access_lock() when
> executing the s390 variant of PCI error recovery: Acquire just
> device_lock() instead of pci_dev_lock() as powerpc's EEH and
> generig PCI AER processing do.
>
> During error recovery testing a pair of tasks was reported to be hung:
>
> [10144.859042] mlx5_core 0000:00:00.1: mlx5_health_try_recover:338:(pid 5553): health recovery flow aborted, PCI reads still not working
> [10320.359160] INFO: task kmcheck:72 blocked for more than 122 seconds.
> [10320.359169] Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359172] task:kmcheck state:D stack:0 pid:72 tgid:72 ppid:2 flags:0x00000000
> [10320.359176] Call Trace:
> [10320.359178] [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359187] [<000000065256f356>] schedule+0x36/0xe0
> [10320.359189] [<000000065256f572>] schedule_preempt_disabled+0x22/0x30
> [10320.359192] [<0000000652570a94>] __mutex_lock.constprop.0+0x484/0x8a8
> [10320.359194] [<000003ff800673a4>] mlx5_unload_one+0x34/0x58 [mlx5_core]
> [10320.359360] [<000003ff8006745c>] mlx5_pci_err_detected+0x94/0x140 [mlx5_core]
> [10320.359400] [<0000000652556c5a>] zpci_event_attempt_error_recovery+0xf2/0x398
> [10320.359406] [<0000000651b9184a>] __zpci_event_error+0x23a/0x2c0
> [10320.359411] [<00000006522b3958>] chsc_process_event_information.constprop.0+0x1c8/0x1e8
> [10320.359416] [<00000006522baf1a>] crw_collect_info+0x272/0x338
> [10320.359418] [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359422] [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359425] [<0000000652576642>] ret_from_fork+0xa/0x30
> [10320.359440] INFO: task kworker/u1664:6:1514 blocked for more than 122 seconds.
> [10320.359441] Not tainted 5.14.0-570.12.1.bringup7.el9.s390x #1
> [10320.359442] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [10320.359443] task:kworker/u1664:6 state:D stack:0 pid:1514 tgid:1514 ppid:2 flags:0x00000000
> [10320.359447] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core]
> [10320.359492] Call Trace:
> [10320.359521] [<000000065256f030>] __schedule+0x2a0/0x590
> [10320.359524] [<000000065256f356>] schedule+0x36/0xe0
> [10320.359526] [<0000000652172e28>] pci_wait_cfg+0x80/0xe8
> [10320.359532] [<0000000652172f94>] pci_cfg_access_lock+0x74/0x88
> [10320.359534] [<000003ff800916b6>] mlx5_vsc_gw_lock+0x36/0x178 [mlx5_core]
> [10320.359585] [<000003ff80098824>] mlx5_crdump_collect+0x34/0x1c8 [mlx5_core]
> [10320.359637] [<000003ff80074b62>] mlx5_fw_fatal_reporter_dump+0x6a/0xe8 [mlx5_core]
> [10320.359680] [<0000000652512242>] devlink_health_do_dump.part.0+0x82/0x168
> [10320.359683] [<0000000652513212>] devlink_health_report+0x19a/0x230
> [10320.359685] [<000003ff80075a12>] mlx5_fw_fatal_reporter_err_work+0xba/0x1b0 [mlx5_core]
> [10320.359728] [<0000000651bbf852>] process_one_work+0x1c2/0x458
> [10320.359733] [<0000000651bc073e>] worker_thread+0x3ce/0x528
> [10320.359735] [<0000000651bc9de0>] kthread+0x108/0x110
> [10320.359737] [<0000000651b42ea4>] __ret_from_fork+0x3c/0x58
> [10320.359739] [<0000000652576642>] ret_from_fork+0xa/0x30
I'd tend to prune this a bit, at the very least I would remove the time
stamp prefix.
>
> No kernel log of the exact same error with an upstream kernel is
> available - but the very same deadlock situation can be constructed there,
> too:
>
> - task: kmcheck
> mlx5_unload_one() tries to acquire devlink lock while the PCI error
> recovery code has set pdev->block_cfg_access by way of
> pci_cfg_access_lock()
> - task: kworker
> mlx5_crdump_collect() tries to set block_cfg_access through
> pci_cfg_access_lock() while devlink_health_report() had acquired
> the devlink lock.
>
> A similar deadlock situation can be reproduced by requesting a
> crdump with
> > devlink health dump show pci/<BDF> reporter fw_fatal
>
> while PCI error recovery is executed on the same <BDF> physical function
> by mlx5_core's pci_error_handlers. On s390 this can be injected with
> > zpcictl --reset-fw <BDF>
>
> Tests with this patch failed to reproduce that second deadlock situation,
> the devlink command is rejected with "kernel answers: Permission denied" -
> and we get a kernel log message of:
>
> Oct 14 13:32:39 b46lp03.lnxne.boe kernel: mlx5_core 1ed0:00:00.1: mlx5_crdump_collect:50:(pid 254382): crdump: failed to lock vsc gw err -5
Same as above I'd remove everyting before the "mlx5_core: …" as that
adds no relevant information.
>
> because the config read of VSC_SEMAPHORE is rejected by the underlying
> hardware.
>
> Two prior attempts to address this issue have been discussed and
> ultimately rejected [see link], with the primary argument that s390's
> implementation of PCI error recovery is imposing restrictions that
> neither powerpc's EEH nor PCI AER handling need. Tests show that PCI
> error recovery on s390 is running to completion even without blocking
> access to PCI config space.
>
> Link: https://lore.kernel.org/all/20251007144826.2825134-1-gbayer@linux.ibm.com/
>
> Fixes: 4cdf2f4e24ff ("s390/pci: implement minimal PCI error recovery")
Besides a Fixes tag let's add "Cc: stable@...r.kernel.org" and you can
put that instead of the empty line between Link and Fixes since usually
these are an uninterrupted block of tags.
> Signed-off-by: Gerd Bayer <gbayer@...ux.ibm.com>
> ---
> All,
>
> sorry for the immediate v2, but I had to rebase this to a current
> upstream commit since v1 didn't apply cleanly, as Niklas pointed out in
> private. The following assessment from v1 is still valid, though:
>
> Hi Niklas, Shay, Jason,
>
> by now I believe fixing this in s390/pci is the right way to go, since
> the other PCI error recovery implementations apparently don't require
> this strict blocking of accesses to the PCI config space.
>
> Hi Alexander, Vasily, Heiko,
>
> while I sent this to netdev since prior versions were discussed there,
> I assume this patch will go through the s390 tree, right?
>
> Thanks,
> Gerd
>
>
> ---
> arch/s390/pci/pci_event.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/s390/pci/pci_event.c b/arch/s390/pci/pci_event.c
> index b95376041501f479eee20705d45fb8c68553da71..27db1e72c623f8a289cae457e87f0a9896ed241d 100644
> --- a/arch/s390/pci/pci_event.c
> +++ b/arch/s390/pci/pci_event.c
> @@ -188,7 +188,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
> * is unbound or probed and that userspace can't access its
> * configuration space while we perform recovery.
> */
> - pci_dev_lock(pdev);
> + device_lock(&pdev->dev);
> if (pdev->error_state == pci_channel_io_perm_failure) {
> ers_res = PCI_ERS_RESULT_DISCONNECT;
> goto out_unlock;
> @@ -257,7 +257,7 @@ static pci_ers_result_t zpci_event_attempt_error_recovery(struct pci_dev *pdev)
> driver->err_handler->resume(pdev);
> pci_uevent_ers(pdev, PCI_ERS_RESULT_RECOVERED);
> out_unlock:
> - pci_dev_unlock(pdev);
> + device_unlock(&pdev->dev);
> zpci_report_status(zdev, "recovery", status_str);
>
> return ers_res;
Code-wise this looks good to me and I've confirmed that EEH and AER
indeed only use the device_lock() and I don't see a reason why that
shouldn't be enough for us too if it is enough for them. I think I just
picked pci_dev_lock() because it seemed fitting but it probably was too
big a hammer.
Reviewed-by: Niklas Schnelle <schnelle@...ux.ibm.com>
Powered by blists - more mailing lists