linux-kernel - Re: [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device synchronously from mhi_pci_runtime

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z5ENq9EMPlNvxNOF@hovoldconsulting.com>
Date: Wed, 22 Jan 2025 16:24:27 +0100
From: Johan Hovold <johan@...nel.org>
To: manivannan.sadhasivam@...aro.org
Cc: mhi@...ts.linux.dev, Loic Poulain <loic.poulain@...aro.org>,
	linux-arm-msm@...r.kernel.org, linux-kernel@...r.kernel.org,
	stable@...r.kernel.org
Subject: Re: [PATCH 2/2] bus: mhi: host: pci_generic: Recover the device
 synchronously from mhi_pci_runtime_resume()

On Wed, Jan 08, 2025 at 07:09:28PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> From: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
> 
> Currently, in mhi_pci_runtime_resume(), if the resume fails, recovery_work
> is started asynchronously and success is returned. But this doesn't align
> with what PM core expects as documented in
> Documentation/power/runtime_pm.rst:
> 
> "Once the subsystem-level resume callback (or the driver resume callback,
> if invoked directly) has completed successfully, the PM core regards the
> device as fully operational, which means that the device _must_ be able to
> complete I/O operations as needed.  The runtime PM status of the device is
> then 'active'."
> 
> So the PM core ends up marking the runtime PM status of the device as
> 'active', even though the device is not able to handle the I/O operations.
> This same condition more or less applies to system resume as well.
> 
> So to avoid this ambiguity, try to recover the device synchronously from
> mhi_pci_runtime_resume() and return the actual error code in the case of
> recovery failure.
> 
> For doing so, move the recovery code to __mhi_pci_recovery_work() helper
> and call that from both mhi_pci_recovery_work() and
> mhi_pci_runtime_resume(). Former still ignores the return value, while the
> latter passes it to PM core.
> 
> Cc: stable@...r.kernel.org # 5.13
> Reported-by: Johan Hovold <johan@...nel.org>
> Closes: https://lore.kernel.org/mhi/Z2PbEPYpqFfrLSJi@hovoldconsulting.com
> Fixes: d3800c1dce24 ("bus: mhi: pci_generic: Add support for runtime PM")
> Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>

Reasoning above makes sense, and I do indeed see resume taking five
seconds longer with this patch as Loic suggested it would.

Unfortunately, something else is broken as the recovery code now
deadlocks again when the modem fails to resume (with both patches
applied):

[  729.833701] PM: suspend entry (deep)
[  729.841377] Filesystems sync: 0.000 seconds
[  729.867672] Freezing user space processes
[  729.869494] Freezing user space processes completed (elapsed 0.001 seconds)
[  729.869499] OOM killer disabled.
[  729.869501] Freezing remaining freezable tasks
[  729.870882] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[  730.184254] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume
[  730.190643] mhi mhi0: Resuming from non M3 state (SYS ERROR)
[  730.196587] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
[  730.203412] mhi-pci-generic 0005:01:00.0: device recovery started

I've reproduced this three times in three different paths (runtime
resume before suspend; runtime resume during suspend; and during system
resume).

I didn't try to figure what causes the deadlock this time (and lockdep
does not trigger), but you should be able to reproduce this by
instrumenting a resume failure.

Johan