lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJZ5v0jDZQaR8S6Kn_RoXHBU86+tpjp=qgyxm5h03YEe2S=nPg@mail.gmail.com>
Date: Tue, 3 Jun 2025 11:38:37 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Mario Limonciello <superm1@...nel.org>, Chris Bainbridge <chris.bainbridge@...il.com>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>, Linux PM <linux-pm@...r.kernel.org>, 
	LKML <linux-kernel@...r.kernel.org>, Alan Stern <stern@...land.harvard.edu>, 
	Ulf Hansson <ulf.hansson@...aro.org>, Johan Hovold <johan@...nel.org>, 
	Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>, Jon Hunter <jonathanh@...dia.com>, 
	Saravana Kannan <saravanak@...gle.com>, amd-gfx@...ts.freedesktop.org
Subject: Re: [PATCH v3 2/5] PM: sleep: Suspend async parents after suspending children

On Mon, Jun 2, 2025 at 9:58 PM Rafael J. Wysocki <rafael@...nel.org> wrote:
>
> On Mon, Jun 2, 2025 at 5:22 PM Mario Limonciello <superm1@...nel.org> wrote:
> >
> > On 6/2/2025 9:29 AM, Rafael J. Wysocki wrote:
> > > On Mon, Jun 2, 2025 at 2:11 PM Chris Bainbridge
> > > <chris.bainbridge@...il.com> wrote:
> > >>
> > >> On Fri, Mar 14, 2025 at 02:13:53PM +0100, Rafael J. Wysocki wrote:
> > >>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> > >>>
> > >>> In analogy with the previous change affecting the resume path,
> > >>> make device_suspend() start the async suspend of the device's parent
> > >>> after the device itself has been processed and make dpm_suspend() start
> > >>> processing "async" leaf devices (that is, devices without children)
> > >>> upfront so they don't need to wait for the "sync" devices they don't
> > >>> depend on.
> > >>>
> > >>> On the Dell XPS13 9360 in my office, this change reduces the total
> > >>> duration of device suspend by approximately 100 ms (over 20%).
> > >>>
> > >>> Suggested-by: Saravana Kannan <saravanak@...gle.com>
> > >>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> > >>
> > >> This commit results in memory corruption on suspend/resume with short
> > >> suspend duration.
> > >
> > > What do you mean by short?
> >
> > The tool he used will program a timer to wake up the system.
> > The time he input was programmed for a cycle that was short enough that
> > the suspend entry didn't finish and it triggered an aborted suspend.
>
> So it crashes during resume when the preceding suspend has been aborted IIUC.
>
> The exact crash mechanism is still unclear to me though.
>
> > >
> > >> Laptop appears to hang and crash is logged to pstore.
> > >
> > > Interesting that this is only happening on one system.
> > >
> > > Thanks for the report anyway, I'll look at this shortly.
> > >
> > >> To reproduce: `amd_s2idle.py --log log --duration 1 --wait 4 --count 30`
> > >>
> > >> I have reproduced this both with and without Mario's recent suspend fix
> > >> https://lore.kernel.org/amd-gfx/20250602014432.3538345-1-superm1@kernel.org/T/#t
> > >>
> > >> Pstore log (with Mario's fix):
> > >>
> > >> <6>[  194.209939] PM: suspend entry (s2idle)
> > >> <6>[  194.409450] Filesystems sync: 0.199 seconds
> > >> <6>[  194.409756] Freezing user space processes
> > >> <6>[  194.411374] Freezing user space processes completed (elapsed 0.001 seconds)
> > >> <6>[  194.411377] OOM killer disabled.
> > >> <6>[  194.411378] Freezing remaining freezable tasks
> > >> <6>[  194.412517] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> > >> <6>[  194.412520] printk: Suspending console(s) (use no_console_suspend to debug)
> > >> <7>[  194.663906] PM: suspend of devices aborted after 0.260 msecs
> > >> <7>[  194.663911] PM: start suspend of devices aborted after 251.365 msecs
> > >> <3>[  194.663913] PM: Some devices failed to suspend, or early wake event detected
> > >> <4>[  194.663975] i2c i2c-3: Unbalanced pm_runtime_enable!
> > >> <4>[  194.663989] ee1004 3-0050: Attempt to enable runtime PM when it is blocked
> > >> Oops#1 Part6
> > >> <4>[  194.663991] ee1004 3-0051: Attempt to enable runtime PM when it is blocked
> > >> <4>[  194.663992] CPU: 5 UID: 0 PID: 121 Comm: kworker/u64:10 Not tainted 6.15.0-rc1-00006-g032a79431b1c #425 PREEMPT(voluntary)
> > >> <4>[  194.663994] Hardware name: HP HP Pavilion Aero Laptop 13-be0xxx/8916, BIOS F.17 12/18/2024
> > >> <4>[  194.663996] Workqueue: async async_run_entry_fn
> > >> <4>[  194.663998]  slab kmalloc-2k
> > >> <4>[  194.664000]
> > >> <4>[  194.664000]  start ffff99bbe24ac800 pointer offset 408
> > >> <4>[  194.664001] Call Trace:
> > >> <4>[  194.664002]  size 2048
> > >> <3>[  194.664003] list_add corruption. prev->next should be next (ffffffff9da75c60), but was ffff99bbd1d94790. (prev=ffff99bbe24ac998).
> > >> <4>[  194.664003]  <TASK>
> > >> <4>[  194.664007]  dump_stack_lvl+0x6e/0x90
> > >> <4>[  194.664011] ------------[ cut here ]------------
> > >> <4>[  194.664011]  pm_runtime_enable.cold+0x28/0x48
> > >> <2>[  194.664011] kernel BUG at lib/list_debug.c:32!
> > >> <4>[  194.664013]  device_resume+0x47/0x200
> > >> <4>[  194.664016] Oops: invalid opcode: 0000 [#1] SMP
> > >> <4>[  194.664017]  async_resume+0x1d/0x30
> > >> <4>[  194.664018] CPU: 2 UID: 0 PID: 2505 Comm: amd_s2idle.py Not tainted 6.15.0-rc1-00006-g032a79431b1c #425 PREEMPT(voluntary)
> > >> <4>[  194.664019]  async_run_entry_fn+0x2e/0x130
> > >> <4>[  194.664020] Hardware name: HP HP Pavilion Aero Laptop 13-be0xxx/8916, BIOS F.17 12/18/2024
> > >> <4>[  194.664021] RIP: 0010:__list_add_valid_or_report+0x90/0xa0
> > >> <4>[  194.664022]  process_one_work+0x22b/0x5b0
> > >> <4>[  194.664024] Code: e4 8a ff 0f 0b 48 89 f7 48 89 34 24 e8 49 57 c6 ff 48 8b 34 24 48 c7 c7 70 d1 64 9d 48 8b 16 48 89 f1 48 89 de e8 00 e4 8a ff <0f> 0b 90 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 41 54
> > >> Oops#1 Part5
> > >> <4>[  194.664025] RSP: 0018:ffffc09a45dafb20 EFLAGS: 00010246
> > >> <4>[  194.664026]  worker_thread+0x1da/0x3d0
> > >> <4>[  194.664027] RAX: 0000000000000075 RBX: ffffffff9da75c60 RCX: 0000000000000027
> > >> <4>[  194.664028] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff99becd11de40
> > >> <4>[  194.664029] RBP: ffffffff9da74c00 R08: 0000000000000000 R09: 0000000000000000
> > >> <4>[  194.664029] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000010
> > >> <4>[  194.664029]  ? bh_worker+0x260/0x260
> > >> <4>[  194.664030] R13: 0000002990e47f3d R14: ffff99bbe24ac998 R15: ffff99bbe0b67620
> > >> <4>[  194.664031] FS:  00007fe534bfc080(0000) GS:ffff99bf2ee50000(0000) knlGS:0000000000000000
> > >> <4>[  194.664031]  kthread+0x10a/0x250
> > >> <4>[  194.664032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >> <4>[  194.664033] CR2: 000055fdfaf910b8 CR3: 000000010d4a9000 CR4: 0000000000f50ef0
> > >> <4>[  194.664034]  ? kthreads_online_cpu+0x130/0x130
> > >> <4>[  194.664034] PKRU: 55555554
> > >> <4>[  194.664035] Call Trace:
> > >> <4>[  194.664036]  <TASK>
> > >> <4>[  194.664036]  ret_from_fork+0x31/0x50
> > >> <4>[  194.664037]  dpm_resume+0x139/0x350
> > >> <4>[  194.664039]  ? kthreads_online_cpu+0x130/0x130
> > >> <4>[  194.664041]  dpm_resume_end+0x11/0x20
> > >> <4>[  194.664040]  ret_from_fork_asm+0x11/0x20
> > >> <4>[  194.664042]  suspend_devices_and_enter+0x18e/0x9f0
> > >> <4>[  194.664045]  </TASK>
> > >> <4>[  194.664046]  pm_suspend.cold+0x22f/0x28f
> > >> <4>[  194.664046] CPU: 4 UID: 0 PID: 115 Comm: kworker/u64:4 Not tainted 6.15.0-rc1-00006-g032a79431b1c #425 PREEMPT(voluntary)
> > >> Oops#1 Part4
> > >> <4>[  194.664048]  state_store+0x6c/0xd0
> > >> <4>[  194.664049] Hardware name: HP HP Pavilion Aero Laptop 13-be0xxx/8916, BIOS F.17 12/18/2024
> > >> <4>[  194.664050] Workqueue: async async_run_entry_fn
> > >> <4>[  194.664051]  kernfs_fop_write_iter+0x194/0x250
> > >> <4>[  194.664052] Call Trace:
> > >> <4>[  194.664052]  <TASK>
> > >> <4>[  194.664053]  dump_stack_lvl+0x6e/0x90
> > >> <4>[  194.664054]  vfs_write+0x2ac/0x550
> > >> <4>[  194.664055]  pm_runtime_enable.cold+0x28/0x48
> > >> <4>[  194.664057]  device_resume+0x47/0x200
> > >> <4>[  194.664058]  ksys_write+0x71/0xe0
> > >> <4>[  194.664060]  async_resume+0x1d/0x30
> > >> <4>[  194.664060]  do_syscall_64+0x95/0x1a0
> > >> <4>[  194.664062]  async_run_entry_fn+0x2e/0x130
> > >> <4>[  194.664062]  ? lockdep_sys_exit+0x1e/0x90
> > >> <4>[  194.664064]  process_one_work+0x22b/0x5b0
> > >> <4>[  194.664064]  ? trace_hardirqs_on_prepare+0x77/0xa0
> > >> <4>[  194.664066]  ? syscall_exit_to_user_mode+0xb1/0x280
> > >> <4>[  194.664067]  worker_thread+0x1da/0x3d0
> > >> <4>[  194.664068]  ? __mutex_lock+0xdb/0xed0
> > >> <4>[  194.664070]  ? __mutex_lock+0xafb/0xed0
> > >> <4>[  194.664070]  ? bh_worker+0x260/0x260
> > >> <4>[  194.664072]  ? kernfs_fop_llseek+0x35/0xd0
> > >> <4>[  194.664072]  kthread+0x10a/0x250
> > >> <4>[  194.664073]  ? lock_release+0x1ff/0x2a0
> > >> <4>[  194.664074]  ? kthreads_online_cpu+0x130/0x130
> > >> <4>[  194.664075]  ? lock_acquire+0x270/0x2d0
> > >> <4>[  194.664076]  ret_from_fork+0x31/0x50
> > >> <4>[  194.664077]  ? __mutex_unlock_slowpath+0x3c/0x2c0
> > >> <4>[  194.664078]  ? kthreads_online_cpu+0x130/0x130
> > >> <4>[  194.664079]  ? kernfs_fop_llseek+0x77/0xd0
> > >> <4>[  194.664079]  ret_from_fork_asm+0x11/0x20
> > >> <4>[  194.664081]  ? lockdep_sys_exit+0x1e/0x90
> > >> <4>[  194.664082]  ? trace_hardirqs_on_prepare+0x77/0xa0
> > >> Oops#1 Part3
> > >> <4>[  194.664084]  </TASK>
> > >> <4>[  194.664084]  ? syscall_exit_to_user_mode+0xb1/0x280
> > >> <4>[  194.664086]  ? do_syscall_64+0xa1/0x1a0
> > >> <4>[  194.664086] uvcvideo 1-3:1.0: Unbalanced pm_runtime_enable!
>
> So it looks like this device is resumed even though it has not been suspended.

Or it is resumed for the second time in a row during the same transition.

I think I know what is going on and the bug is not in the commit in question.

There is a race between dpm_async_resume_children() and the first loop
in dpm_resume() which fortunately is only fatal when the preceding
suspend transition is aborted.  Namely, that loop can call
dpm_clear_async_state() for a device after dpm_async_with_cleanup()
has run for it, so power.work_in_progress gets cleared and
__dpm_async() will allow an async work to be scheduled for the same
device once again.

Chris, please check if the attached patch helps.  I'm going to post it
as a fix anyway later today, but it would be good to verify that it is
sufficient.

Thanks!

View attachment "pm-sleep-fix-async-resume.patch" of type "text/x-patch" (555 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ