linux-kernel - Re: PM runtime_error handling missing in many drivers?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <Z7ktqHxIhp90jLxi@google.com>
Date: Fri, 21 Feb 2025 17:51:36 -0800
From: Brian Norris <briannorris@...omium.org>
To: Oliver Neukum <oneukum@...e.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
	Ajay Agarwal <ajayagarwal@...gle.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
	"jic23@...nel.org" <jic23@...nel.org>,
	"linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-iio@...r.kernel.org" <linux-iio@...r.kernel.org>
Subject: Re: PM runtime_error handling missing in many drivers?

Hi Oliver,

On Thu, Feb 20, 2025 at 10:30:34AM +0100, Oliver Neukum wrote:
> On 19.02.25 23:15, Brian Norris wrote:
> > On Wed, Feb 12, 2025 at 08:29:34PM +0100, Rafael J. Wysocki wrote:
> > > The reason why runtime_error is there is to prevent runtime PM
> > > callbacks from being run until something is done about the error,
> > > under the assumption that running them in that case may make the
> > > problem worse.
> > 
> > What makes you think it will make the problem worse? That seems like a
> > rather large assumption to me. What kind of things do you think go
> > wrong, that it requires the framework to stop any future attempts? Just
> > spam (e.g., logging noise, if -EIO is persistent)? Or something worse?e
> 
> suspend() is three operations, potentially
> 
> a) record device state
> b) arm remote wakeup
> c) transition to a lower power state
> 
> I wouldn't trust a device to perform the first two steps
> without error handling either. It is an unnecessary risk.

I'm not sure I fully understand what you're saying. I'm not saying
drivers shouldn't handle errors. I'm just saying I don't see why the
framework should decide, "fail once and you're out."

Do you think (a) or (b) will fail silently if retried after a failed
operation? And what's the consequence?

> > But anyway, I don't think I require asymmetry; I'm just more interested
> > in unnecessary non-functionality. (Power inefficiency is less important,
> > as in the worst case, we can at least save our data, reboot, and try
> > again.)
> 
> You are calling for asymmetry ;-)

Actually, you were the one who proposed asymmetry :) My concern is
asymmetric, but the solution doesn't have to be. For example, we could
remove runtime_error entirely, or else make it some kind of
ratelimited/backoff state.

Anyway, I appreciate that Rafael has helped improve the situation a bit
([PATCH v1] PM: runtime: Unify error handling during suspend and
resume). At least it gives us a tool to achieve what we want: ensure
that retriable failures produce -EBUSY or -EAGAIN. I'll have to give it
a whirl.

But I'm still wary that there are corner cases where other errors may
appear, and yet retrying is indeed the best option. And I'm not
confident that foisting the burden back onto the driver ("just scatter
pm_runtime_set_suspended() any time you might have fixed something") is
a practical approach either.

> If you fail to resume, you will need to return an error. The functions
> are just not equal in terms of consequences. We don't resume for fun.
> We do, however, suspend just because a timer fires.

Agreed.

Brian