linux-kernel - Re: mhi resume failure on reboot with 6.13-rc2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241218113830.efflnkxz7qu24xux@thinkpad>
Date: Wed, 18 Dec 2024 17:08:30 +0530
From: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
To: Johan Hovold <johan@...nel.org>
Cc: mhi@...ts.linux.dev, linux-arm-msm@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Loic Poulain <loic.poulain@...aro.org>
Subject: Re: mhi resume failure on reboot with 6.13-rc2

On Wed, Dec 18, 2024 at 09:40:45AM +0100, Johan Hovold wrote:
> On Mon, Dec 16, 2024 at 07:43:03PM +0530, Manivannan Sadhasivam wrote:
> > On Mon, Dec 16, 2024 at 02:20:09PM +0100, Johan Hovold wrote:
> > > On Mon, Dec 16, 2024 at 01:10:21PM +0530, Manivannan Sadhasivam wrote:
> > > > On Wed, Dec 11, 2024 at 04:03:59PM +0100, Johan Hovold wrote:
> 
> > > I just hit the issue again and can confirm that it does block
> > > reboot/shutdown forever (I've been waiting for 20 minutes now).
> > 
> > Ah, that's bad.
> > 
> > > Judging from a quick look at the code, "Wait for device to enter SBL or
> > > Mission mode" is printed by mhi_fw_load_handler(), which in turn is only
> > > called from the mhi_pm_st_worker() state machine.
> > > 
> > > I can't seem to find anything that makes sure that the next state is
> > > ever reached, so regardless of the cause of the modem fw crash
> > 
> > This code will make sure:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/bus/mhi/host/pm.c?h=v6.13-rc1#n1264
> > 
> > But then it doesn't print the error and returns -ETIMEDOUT to the caller after
> > powering down MHI. The caller (mhi_pci_recovery_work), in the case of failure,
> > unprepares MHI and starts function level recovery.
> > 
> > > (if
> > > that's what it is) the hung reboot appears to be a bug in mhi.
> 
> I've tracked down the hang to a deadlock on the parent device lock.
> 
> Driver core takes the parent device lock before calling shutdown(), and
> then mhi_pci_shutdown() waits indefinitely for the recovery thread to
> finish.
> 
> But the mhi recovery thread ends up trying to take the same parent
> device lock in pci_reset_function() when recovery fails:
> 
> [  339.351915] shutdown[1]: Rebooting.
> [  339.724498] arm-smmu 3da0000.iommu: disabling translation
> [  339.760134] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> [  339.766211] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> [  339.773158] mhi-pci-generic 0005:01:00.0: device recovery started
> 
> The recovery thread is running before shutdown() is called.
> 
> [  339.779638] mhi-pci-generic 0005:01:00.0: __mhi_power_down
> [  339.779650] mhi-pci-generic 0005:01:00.0: mhi_pci_shutdown
> [  339.785422] wwan wwan0: port wwan0qcdm0 disconnected
> [  339.791001] mhi-pci-generic 0005:01:00.0: mhi_pci_remove
> [  339.791006] mhi-pci-generic 0005:01:00.0: mhi_pci_remove - cancel work sync
> 
> shutdown() waits for the recovery thread to finish
> 
> [  339.825892] wwan wwan0: port wwan0mbim0 disconnected
> [  339.831320] wwan wwan0: port wwan0qmi0 disconnected
> [  339.904249] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
> [  340.025390] mhi mhi0: Requested to power ON
> [  340.233771] mhi mhi0: Power on setup success
> [  340.233954] mhi mhi0: Wait for device to enter SBL or Mission mode
> [  340.238272] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000
> [  348.400082] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = -110
> 
> The recovery thread fails to power up the device.
> 
> [  348.419967] mhi-pci-generic 0005:01:00.0: __mhi_power_down
> [  348.472665] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
> [  348.725069] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns
> [  348.742644] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - mhi unprepare after power down
> [  348.762737] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - pci reset
> [  348.780904] mhi-pci-generic 0005:01:00.0: pci_reset_function
> 
> And tries to reset the device, which triggers the deadlock when
> trying to take the already held parent (bridge) device lock.
> 

Thanks for tracking the deadlock. I think we should use pci_try_reset_function()
instead of pci_reset_function() in mhi_pci_recovery_work().

If the pci_dev_lock() is already taken, it will return with -EAGAIN and we do
not need to worry in that case since the host is going to be powered off anyway
(and so the device).

- Mani

-- 
மணிவண்ணன் சதாசிவம்