linux-kernel - Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJZ5v0hm=jfSyBXF0qMYnpATJf56JTxQ-+4JBy3YMjS0cMUMHg@mail.gmail.com>
Date: Mon, 1 Dec 2025 19:47:46 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: YangYang <yang.yang@...o.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Bart Van Assche <bvanassche@....org>, Jens Axboe <axboe@...nel.dk>, 
	Pavel Machek <pavel@...nel.org>, Len Brown <lenb@...nel.org>, 
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>, Danilo Krummrich <dakr@...nel.org>, 
	linux-block@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-pm@...r.kernel.org
Subject: Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume
 and runtime disable

On Mon, Dec 1, 2025 at 10:46 AM YangYang <yang.yang@...o.com> wrote:
>
> On 2025/11/27 20:34, Rafael J. Wysocki wrote:
> > On Wed, Nov 26, 2025 at 11:47 PM Bart Van Assche <bvanassche@....org> wrote:
> >>
> >> On 11/26/25 1:30 PM, Rafael J. Wysocki wrote:
> >>> On Wed, Nov 26, 2025 at 10:11 PM Bart Van Assche <bvanassche@....org> wrote:
> >>>>
> >>>> On 11/26/25 12:17 PM, Rafael J. Wysocki wrote:
> >>>>> --- a/block/blk-core.c
> >>>>> +++ b/block/blk-core.c
> >>>>> @@ -309,6 +309,8 @@ int blk_queue_enter(struct request_queue
> >>>>>                 if (flags & BLK_MQ_REQ_NOWAIT)
> >>>>>                         return -EAGAIN;
> >>>>>
> >>>>> +             /* if necessary, resume .dev (assume success). */
> >>>>> +             blk_pm_resume_queue(pm, q);
> >>>>>                 /*
> >>>>>                  * read pair of barrier in blk_freeze_queue_start(), we need to
> >>>>>                  * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and
> >>>>
> >>>> blk_queue_enter() may be called from the suspend path so I don't think
> >>>> that the above change will work.
> >>>
> >>> Why would the existing code work then?
> >>
> >> The existing code works reliably on a very large number of devices.
> >
> > Well, except that it doesn't work during system suspend and
> > hibernation when the PM workqueue is frozen.  I think that we agree
> > here.
> >
> > This needs to be addressed because it may very well cause system
> > suspend to deadlock.
> >
> > There are two possible ways to address it I can think of:
> >
> > 1. Changing blk_pm_resume_queue() and its users to carry out a
> > synchronous resume of q->dev instead of calling pm_request_resume()
> > and (effectively) waiting for the queued-up runtime resume of q->dev
> > to take effect.
> >
> > This would be my preferred option, but at this point I'm not sure if
> > it's viable.
> >
>
> After __pm_runtime_disable() is called from device_suspend_late(),
> dev->power.disable_depth is set, preventing rpm_resume() from making
> progress until the system resume completes, regardless of whether
> rpm_resume() is invoked synchronously or asynchronously.

This isn't factually correct.  rpm_resume() will make progress when
runtime PM is disabled, but it will not resume the target device.
That's what disabling runtime PM means.

Of course, when runtime PM is disabled for the given device,
rpm_resume() will return an error code that can be checked.  However,
if pm_request_resume() is called before disabling runtime PM for the
device and runtime PM is disabled for it before the work item queued
by pm_request_resume() runs, the failure will be silent from the
caller's perspective.

> Performing a synchronous resume of q->dev seems to have a similar
> effect to removing the following code block from
> __pm_runtime_barrier(), which is invoked by __pm_runtime_disable():
>
> 1428     if (dev->power.request_pending) {
> 1429         dev->power.request = RPM_REQ_NONE;
> 1430         spin_unlock_irq(&dev->power.lock);
> 1431
> 1432         cancel_work_sync(&dev->power.work);
> 1433
> 1434         spin_lock_irq(&dev->power.lock);
> 1435         dev->power.request_pending = false;
> 1436     }

It is different.

First of all, synchronous runtime resume is not affected by the
freezing of the runtime PM workqueue.  Next, see the remark above
regarding returning an error code.  Finally, so long as
__pm_runtime_resume() acquires power.lock before
__pm_runtime_disable(), the synchronous resume will be waited for by
the latter.

Generally speaking, if blk_queue_enter() or __bio_queue_enter() may
run in parallel with device_suspend_late() for q->dev, the driver of
that device is defective, because it is responsible for preventing
this situation from happening.  The most straightforward way to
achieve that is to provide a .suspend() callback for q->dev that will
runtime-resume it (and, of course, q->dev will need to be prepared for
system suspend as appropriate after that).

If blk_queue_enter() or __bio_queue_enter() is allowed to race with
disabling runtime PM for q->dev, failure to resume q->dev is alway
possible and there are no changes that can be made to
pm_runtime_disable() to prevent that from happening.  If
__pm_runtime_disable() wins the race, it will increment
power.disable_depth and rpm_resume() will bail out when it sees that
no matter what.

You should not conflate "runtime PM doesn't work when it is disabled"
with "asynchronous runtime PM doesn't work after freezing the PM
workqueue".  They are both true, but they are not the same.