linux-kernel - Re: PM runtime_error handling missing in many drivers?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <50de9721-2dd8-448b-8c11-50b3923450f6@suse.com>
Date: Thu, 20 Feb 2025 10:30:34 +0100
From: Oliver Neukum <oneukum@...e.com>
To: Brian Norris <briannorris@...omium.org>,
 "Rafael J. Wysocki" <rafael@...nel.org>
Cc: Ajay Agarwal <ajayagarwal@...gle.com>, Oliver Neukum <oneukum@...e.com>,
 "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
 Vincent Whitchurch <vincent.whitchurch@...s.com>,
 "jic23@...nel.org" <jic23@...nel.org>,
 "linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "linux-iio@...r.kernel.org" <linux-iio@...r.kernel.org>
Subject: Re: PM runtime_error handling missing in many drivers?

On 19.02.25 23:15, Brian Norris wrote:
> On Wed, Feb 12, 2025 at 08:29:34PM +0100, Rafael J. Wysocki wrote:
>> The reason why runtime_error is there is to prevent runtime PM
>> callbacks from being run until something is done about the error,
>> under the assumption that running them in that case may make the
>> problem worse.
> 
> What makes you think it will make the problem worse? That seems like a
> rather large assumption to me. What kind of things do you think go
> wrong, that it requires the framework to stop any future attempts? Just
> spam (e.g., logging noise, if -EIO is persistent)? Or something worse?e

suspend() is three operations, potentially

a) record device state
b) arm remote wakeup
c) transition to a lower power state

I wouldn't trust a device to perform the first two steps
without error handling either. It is an unnecessary risk.

> And OTOH, there are clearly cases where retrying would be not only
> acceptable, but expected -- so giving special case to -EAGAIN and
> -EBUSY, per another branch of this thread, seems wise.

Yes

> 
> I'd also note that AFAICT, there is no similar feature in system PM. If
> suspend() fails, we unwind and report the error ... but still allow
> future system suspend requests. resume() is even "worse" -- errors are
> essentially logged and ignored.

Suspend requests from runtime PM are different. They happen spontaneously.
Secondly, failures to suspend in runtime PM are far cheaper.

>> I'm not sure if I see a substantial difference between suspend and
>> resume in that respect: If any of them fails, the state of the device
>> is kind of unstable.  In particular, if resume fails and the device
>> doesn't actually resume, something needs to be done about it or it
>> just becomes unusable.

Again, if you look at it in an abstract manner, this is a mess. Resume()
is actually two functions

a) transition to a power state that allows an operation
b) restore device settings

It is possible for the second step to fail after the first has worked.

> To me, it's about the state of the device. If suspend failed, the device
> may still be active and functional -- but not power-efficient. If resume
> failed, the device may be suspended and non-functional.
> 
> But anyway, I don't think I require asymmetry; I'm just more interested
> in unnecessary non-functionality. (Power inefficiency is less important,
> as in the worst case, we can at least save our data, reboot, and try
> again.)

You are calling for asymmetry ;-)

If you fail to resume, you will need to return an error. The functions
are just not equal in terms of consequences. We don't resume for fun.
We do, however, suspend just because a timer fires.

	Regards
		Oliver