linux-kernel - Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume and runtime disable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJZ5v0iSgrLzsjh+bvF2=rxxhYcBetJ6V-joWaQud4ahkm1GkQ@mail.gmail.com>
Date: Wed, 26 Nov 2025 13:36:27 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: YangYang <yang.yang@...o.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Jens Axboe <axboe@...nel.dk>, Pavel Machek <pavel@...nel.org>, 
	Len Brown <lenb@...nel.org>, Greg Kroah-Hartman <gregkh@...uxfoundation.org>, 
	Danilo Krummrich <dakr@...nel.org>, linux-block@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-pm@...r.kernel.org
Subject: Re: [PATCH 1/2] PM: runtime: Fix I/O hang due to race between resume
 and runtime disable

On Wed, Nov 26, 2025 at 12:59 PM YangYang <yang.yang@...o.com> wrote:
>
> On 2025/11/26 19:30, Rafael J. Wysocki wrote:
> > On Wed, Nov 26, 2025 at 11:17 AM Yang Yang <yang.yang@...o.com> wrote:
> >>
> >> We observed the following hung task during our test:
> >>
> >> [ 3987.095999] INFO: task "kworker/u32:7":239 blocked for more than 188 seconds.
> >> [ 3987.096017] task:kworker/u32:7   state:D stack:0     pid:239   tgid:239   ppid:2      flags:0x00000408
> >> [ 3987.096042] Workqueue: writeback wb_workfn (flush-254:59)
> >> [ 3987.096069] Call trace:
> >> [ 3987.096073]  __switch_to+0x1a0/0x318
> >> [ 3987.096089]  __schedule+0xa38/0xf9c
> >> [ 3987.096104]  schedule+0x74/0x10c
> >> [ 3987.096118]  __bio_queue_enter+0xb8/0x178
> >> [ 3987.096132]  blk_mq_submit_bio+0x104/0x728
> >> [ 3987.096145]  __submit_bio+0xa0/0x23c
> >> [ 3987.096159]  submit_bio_noacct_nocheck+0x164/0x330
> >> [ 3987.096173]  submit_bio_noacct+0x348/0x468
> >> [ 3987.096186]  submit_bio+0x17c/0x198
> >> [ 3987.096199]  f2fs_submit_write_bio+0x44/0xe8
> >> [ 3987.096211]  __submit_merged_bio+0x40/0x11c
> >> [ 3987.096222]  __submit_merged_write_cond+0xcc/0x1f8
> >> [ 3987.096233]  f2fs_write_data_pages+0xbb8/0xd0c
> >> [ 3987.096246]  do_writepages+0xe0/0x2f4
> >> [ 3987.096255]  __writeback_single_inode+0x44/0x4ac
> >> [ 3987.096272]  writeback_sb_inodes+0x30c/0x538
> >> [ 3987.096289]  __writeback_inodes_wb+0x9c/0xec
> >> [ 3987.096305]  wb_writeback+0x158/0x440
> >> [ 3987.096321]  wb_workfn+0x388/0x5d4
> >> [ 3987.096335]  process_scheduled_works+0x1c4/0x45c
> >> [ 3987.096346]  worker_thread+0x32c/0x3e8
> >> [ 3987.096356]  kthread+0x11c/0x1b0
> >> [ 3987.096372]  ret_from_fork+0x10/0x20
> >>
> >>   T1:                                   T2:
> >>   blk_queue_enter
> >>   blk_pm_resume_queue
> >>   pm_request_resume
> >
> > Shouldn't this be pm_runtime_resume() rather?
>
> I'm not sure about that, I'll check if pm_runtime_resume() should be
> used here instead.

Well, the code as is now schedules an async resume of the device and
then waits for it to complete.  It would be more straightforward to
resume the device synchronously IMV.

> >
> >>   __pm_runtime_resume(dev, RPM_ASYNC)
> >>   rpm_resume                            __pm_runtime_disable
> >>   dev->power.request_pending = true     dev->power.disable_depth++
> >>   queue_work(pm_wq, &dev->power.work)   __pm_runtime_barrier
> >>   wait_event                            cancel_work_sync(&dev->power.work)
> >>
> >> T1 queues the work item, which is then cancelled by T2 before it starts
> >> execution. As a result, q->dev cannot be resumed, and T1 waits here for
> >> a long time.
> >>
> >> Signed-off-by: Yang Yang <yang.yang@...o.com>
> >> ---
> >>   drivers/base/power/runtime.c | 3 ++-
> >>   include/linux/pm.h           | 1 +
> >>   2 files changed, 3 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> >> index 1b11a3cd4acc..fc9bf3fb3bb7 100644
> >> --- a/drivers/base/power/runtime.c
> >> +++ b/drivers/base/power/runtime.c
> >> @@ -1533,7 +1533,8 @@ void __pm_runtime_disable(struct device *dev, bool check_resume)
> >>           * means there probably is some I/O to process and disabling runtime PM
> >>           * shouldn't prevent the device from processing the I/O.
> >>           */
> >> -       if (check_resume && dev->power.request_pending &&
> >> +       if ((check_resume || dev->power.force_check_resume) &&
> >> +           dev->power.request_pending &&
> >>              dev->power.request == RPM_REQ_RESUME) {
> >>                  /*
> >>                   * Prevent suspends and idle notifications from being carried
> >
> > There are only two cases in which false is passed to
> > __pm_runtime_disable(), one is in device_suspend_late() and I don't
> > think that's relevant here, and the other is in pm_runtime_remove()
> > and that's get called when the device is going away.
> >
> > So apparently, blk_pm_resume_queue() races with the device going away.
> > Is this expected to happen even?
> >
> > If so, wouldn't it be better to modify pm_runtime_remove() to pass
> > true to __pm_runtime_disable() instead of making these ad hoc changes?
> >
>
> Sorry, I didn't make it clear in my previous message.
> I can confirm that __pm_runtime_disable() is called from
> device_suspend_late(), and this issue occurs during system suspend.

Interesting because the runtime PM workqueue is frozen at this point,
so waiting for a work item in it to complete is pointless.

What the patch does is to declare that the device can be
runtime-resumed in device_suspend_late(), but this is kind of a hack
IMV as it potentially affects the device's parent etc.

If the device cannot stay in runtime suspend across the entire system
suspend transition, it should be resumed (synchronously) earlier, in
device_suspend() or in device_prepare() even.