lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 7 Feb 2024 11:59:47 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Marek Szyprowski <m.szyprowski@...sung.com>
Cc: "Rafael J. Wysocki" <rjw@...ysocki.net>, Linux PM <linux-pm@...r.kernel.org>, 
	Ulf Hansson <ulf.hansson@...aro.org>, Tejun Heo <tj@...nel.org>, 
	Nathan Chancellor <nathan@...nel.org>, LKML <linux-kernel@...r.kernel.org>, 
	Stanislaw Gruszka <stanislaw.gruszka@...ux.intel.com>, Lai Jiangshan <jiangshanlai@...il.com>, 
	Naohiro.Aota@....com, kernel-team@...a.com, 
	Konrad Dybcio <konradybcio@...nel.org>
Subject: Re: [PATCH v1] PM: sleep: Restore asynchronous device resume optimization

On Wed, Feb 7, 2024 at 11:31 AM Marek Szyprowski
<m.szyprowski@...sung.com> wrote:
>
> Dear All,
>
> On 09.01.2024 17:59, Rafael J. Wysocki wrote:
> >   From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> >
> > Before commit 7839d0078e0d ("PM: sleep: Fix possible deadlocks in core
> > system-wide PM code"), the resume of devices that were allowed to resume
> > asynchronously was scheduled before starting the resume of the other
> > devices, so the former did not have to wait for the latter unless
> > functional dependencies were present.
> >
> > Commit 7839d0078e0d removed that optimization in order to address a
> > correctness issue, but it can be restored with the help of a new device
> > power management flag, so do that now.
> >
> > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> > ---
>
> This patch finally landed in linux-next some time ago as 3e999770ac1c
> ("PM: sleep: Restore asynchronous device resume optimization"). Recently
> I found that it causes a non-trivial interaction with commit
> 5797b1c18919 ("workqueue: Implement system-wide nr_active enforcement
> for unbound workqueues"). Since merge commit 954350a5f8db in linux-next
> system suspend/resume fails (board doesn't wake up) on my old Samsung
> Exynos4412-based Odroid-U3 board (ARM 32bit based), which was rock
> stable for last years.
>
> My further investigations confirmed that the mentioned commits are
> responsible for this issue. Each of them separately (3e999770ac1c and
> 5797b1c18919) doesn't trigger any problems. Reverting any of them on top
> of linux-next (with some additional commit due to code dependencies)
> also fixes/hides the problem.
>
> Let me know if You need more information or tests on the hardware. I'm
> open to help debugging this issue.

So I found this report:

https://lore.kernel.org/lkml/b3d08cd8-d77f-45dd-a2c3-4a4db5a98dfa@kernel.org/

which appears to be about the same issue.

Strictly speaking, the regression is introduced by 5797b1c18919,
because it is not a mainline commit yet, but the information regarding
the interaction of it with commit 3e999770ac1c is valuable.

Essentially, what commit 3e999770ac1c does is to schedule the
execution of all of the async resume callbacks in a given phase
upfront, so they can run without waiting for the sync ones to complete
(except when there is a parent-child or supplier-consumer dependency -
the latter represented by a device link).

What seems to be happening after commit 5797b1c18919 is that one (or
more) of the async callbacks get blocked in the workqueue for some
reason.

I would try to replace system_unbound_wq in
__async_schedule_node_domain() with a dedicated workqueue that is not
unbound and see what happens.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ