linux-kernel - Re: PM runtime_error handling missing in many drivers?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5c37ee19-fe2c-fb22-63a2-638e3dab8f7a@suse.com>
Date:   Wed, 27 Jul 2022 10:08:06 +0200
From:   Oliver Neukum <oneukum@...e.com>
To:     "Rafael J. Wysocki" <rafael@...nel.org>,
        Oliver Neukum <oneukum@...e.com>
Cc:     "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
        Vincent Whitchurch <vincent.whitchurch@...s.com>,
        "jic23@...nel.org" <jic23@...nel.org>,
        "linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-iio@...r.kernel.org" <linux-iio@...r.kernel.org>
Subject: Re: PM runtime_error handling missing in many drivers?



On 26.07.22 17:41, Rafael J. Wysocki wrote:
> On Tue, Jul 26, 2022 at 11:05 AM Oliver Neukum <oneukum@...e.com> wrote:

> I guess that depends on what is regarded as "the framework".  I mean
> the PM-runtime code, excluding the bus type or equivalent.

Yes, we have multiple candidates in the generic case. Easy to overengineer.

>>> The idea was that drivers would clear these errors.
>>
>> I am afraid that is a deeply hidden layering violation. Yes, a driver's
>> resume() method may have failed. In that case, if that is the same
>> driver, it will obviously already know about the failure.
> 
> So presumably it will do something to recover and avoid returning the
> error in the first place.

Yes, but that does not help us if they do return an error.

> From the PM-runtime core code perspective, if an error is returned by
> a suspend callback and it is not -EBUSY or -EAGAIN, the subsequent
> suspend is also likely to fail.

True.

> If a resume callback returns an error, any subsequent suspend or
> resume operations are likely to fail.

Also true, but the consequences are different.

> Storing the error effectively prevents subsequent operations from
> being carried out in both cases and that's why it is done.

I am afraid seeing these two operations as equivalent for this
purpose is a problem for two reasons:

1. suspend can be initiated by the generic framework
2. a failure to suspend leads to worse power consumption,
   while a failure to resume is -EIO, at best

>> PM operations, however, are operating on a tree. A driver requesting
>> a resume may get an error code. But it has no idea where this error
>> comes from. The generic code knows at least that.
> 
> Well, what do you mean by "the generic code"?

In this case the device model, which has the tree and all dependencies.
Error handling here is potentially very complicated because

1. a driver can experience an error from a node higher in the tree
2. a driver can trigger a failure in a sibling
3. a driver for a node can be less specific than the drivers higher up

Reducing this to a single error condition is difficult.
Suppose you have a USB device with two interfaces. The driver for A
initiates a resume. Interface A is resumed; B reports an error.
Should this block further attempts to suspend the whole device?

>> Let's look at at a USB storage device. The request to resume comes
>> from sd.c. sd.c is certainly not equipped to handle a PCI error
>> condition that has prevented a USB host controller from resuming.
> 
> Sure, but this doesn't mean that suspending or resuming the device is
> a good idea until the error condition gets resolved.

Suspending clearly yes. Resuming is another matter. It has to work
if you want to operate without errors.

>> I am afraid this part of the API has issues. And they keep growing
>> the more we divorce the device driver from the bus driver, which
>> actually does the PM operation.
> 
> Well, in general suspending or resuming a device is a collaborative
> effort and if one of the pieces falls over, making it work again
> involves fixing up the failing piece and notifying the others that it
> is ready again.  However, that part isn't covered and I'm not sure if
> it can be covered in a sufficiently generic way.

True. But that still cannot solve the question what is to be done
if error handling fails. Hence my proposal:
- record all failures
- heed the record only when suspending

	Regards
		Oliver