linux-kernel - Re: mhi resume failure on reboot with 6.13-rc2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z2PbEPYpqFfrLSJi@hovoldconsulting.com>
Date: Thu, 19 Dec 2024 09:36:32 +0100
From: Johan Hovold <johan@...nel.org>
To: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
Cc: mhi@...ts.linux.dev, linux-arm-msm@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Loic Poulain <loic.poulain@...aro.org>
Subject: Re: mhi resume failure on reboot with 6.13-rc2

On Thu, Dec 19, 2024 at 12:05:55AM +0530, Manivannan Sadhasivam wrote:
> On Wed, Dec 18, 2024 at 03:26:38PM +0100, Johan Hovold wrote:
> > On Wed, Dec 18, 2024 at 07:39:10PM +0530, Manivannan Sadhasivam wrote:
> > > On Wed, Dec 18, 2024 at 02:55:02PM +0100, Johan Hovold wrote:

> > > > But that's not going to happen as that reset is what is currently
> > > > causing the deadlock and which would simply be skipped if you switch to
> > > > pci_try_reset_function().
> > > > 
> > > 
> > > mhi_pci_runtime_resume() will queue the recovery_work() and return. So I was
> > > hoping that by the time pci_try_reset_function() is called, the lock would be
> > > available.
> > 
> > We can't rely on luck with timings, and this is the very reason for the
> > deadlock I'm currently seeing (i.e. the recovery thread is still running
> > when another thread grabs the lock and waits for the recovery thread to
> > finish).
> > 
> > Perhaps the recovery work should be done synchronously in the resume
> > handler to avoid such issues.
> 
> Synchronously? How can that help when the recovery_work() cannot acquire the
> lock?

During system suspend, pm core waits for any on-going runtime resume
operations to complete before taking the device lock and suspending the
device.

Unfortunately, that's currently not the case during shutdown() where
those operations are reversed, so that would indeed need to be addressed
first.

But what the driver is currently doing looks highly questionable as it
returns success when it failed to resume the device (after scheduling
the asynchronous recovery work).

Johan