linux-kernel - Re: [PATCH 0/2] PM: runtime: Fix potential I/O hang

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJZ5v0jwKaCDjzgo-AP62DDd=Nww9KbYp6tUxuQQ-Hp11PXMig@mail.gmail.com>
Date: Wed, 26 Nov 2025 18:34:35 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Bart Van Assche <bvanassche@....org>, Yang Yang <yang.yang@...o.com>
Cc: Jens Axboe <axboe@...nel.dk>, Pavel Machek <pavel@...nel.org>, Len Brown <lenb@...nel.org>, 
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>, Danilo Krummrich <dakr@...nel.org>, 
	linux-block@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-pm@...r.kernel.org
Subject: Re: [PATCH 0/2] PM: runtime: Fix potential I/O hang

On Wed, Nov 26, 2025 at 6:21 PM Rafael J. Wysocki <rafael@...nel.org> wrote:
>
> On Wed, Nov 26, 2025 at 5:59 PM Rafael J. Wysocki <rafael@...nel.org> wrote:
> >
> > On Wed, Nov 26, 2025 at 4:48 PM Bart Van Assche <bvanassche@....org> wrote:
> > >
> > > On 11/26/25 3:31 AM, Rafael J. Wysocki wrote:
> > > > Please address the issue differently.
> > >
> > > It seems unfortunate to me that __pm_runtime_barrier() can cause pm_request_resume() to hang.
> >
> > I wouldn't call it a hang.
> >
> > __pm_runtime_barrier() removes the work item queued by
> > pm_request_resume(), but at the time when it is called, which is
> > device_suspend_late(), the work item queued by pm_request_resume()
> > cannot make progress anyway.  It will only be able to make progress
> > when the PM workqueue is unfrozen at the end of the system resume
> > transition.
> >
> > > Would it be safe to remove the
> > > cancel_work_sync() call from __pm_runtime_barrier() since
> > > pm_runtime_work() calls functions that check disable_depth
> > > when processing RPM_REQ_SUSPEND and RPM_REQ_AUTOSUSPEND? Would
> > > this be sufficient to fix the reported deadlock?
> >
> > If you want the resume work item to survive the system suspend/resume
> > cycle, __pm_runtime_disable() may be changed to make that happen, but
> > this still will not allow the work to make progress until the system
> > resume ends.
> >
> > I'm not sure if this would help to address the issue at hand though.
>
> I actually have a better idea: Why don't we resume all devices that
> have runtime resume work items pending at the time when
> device_suspend() is called?
>
> Arguably, somebody wanted them to runtime-resume, so they should be
> resumed before being prepared for system suspend and that will
> eliminate the issue at hand (because devices cannot suspend during
> system suspend/resume).

Wait, there is a pm_runtime_barrier() call in device_suspend() that
does just that and additionally it calls __pm_runtime_barrier(), so
all of the pending runtime PM work items should be cancelled by it.

So it looks like the device in question is runtime-suspended at that
point and only later blk_pm_resume_queue() is called to resume it.
I'm wondering where it is called from.  And maybe pm_runtime_resume()
should be called for it from its ->suspend() callback?