[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJZ5v0ijy4FG84xk_n8gxR_jS0xao246eVbnFj-dXzwz=8S9NQ@mail.gmail.com>
Date: Wed, 27 Jul 2022 18:31:48 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Oliver Neukum <oneukum@...e.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
Vincent Whitchurch <vincent.whitchurch@...s.com>,
"jic23@...nel.org" <jic23@...nel.org>,
"linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-iio@...r.kernel.org" <linux-iio@...r.kernel.org>
Subject: Re: PM runtime_error handling missing in many drivers?
On Wed, Jul 27, 2022 at 10:08 AM Oliver Neukum <oneukum@...e.com> wrote:
>
>
>
> On 26.07.22 17:41, Rafael J. Wysocki wrote:
> > On Tue, Jul 26, 2022 at 11:05 AM Oliver Neukum <oneukum@...e.com> wrote:
>
> > I guess that depends on what is regarded as "the framework". I mean
> > the PM-runtime code, excluding the bus type or equivalent.
>
> Yes, we have multiple candidates in the generic case. Easy to overengineer.
>
> >>> The idea was that drivers would clear these errors.
> >>
> >> I am afraid that is a deeply hidden layering violation. Yes, a driver's
> >> resume() method may have failed. In that case, if that is the same
> >> driver, it will obviously already know about the failure.
> >
> > So presumably it will do something to recover and avoid returning the
> > error in the first place.
>
> Yes, but that does not help us if they do return an error.
>
> > From the PM-runtime core code perspective, if an error is returned by
> > a suspend callback and it is not -EBUSY or -EAGAIN, the subsequent
> > suspend is also likely to fail.
>
> True.
>
> > If a resume callback returns an error, any subsequent suspend or
> > resume operations are likely to fail.
>
> Also true, but the consequences are different.
>
> > Storing the error effectively prevents subsequent operations from
> > being carried out in both cases and that's why it is done.
>
> I am afraid seeing these two operations as equivalent for this
> purpose is a problem for two reasons:
>
> 1. suspend can be initiated by the generic framework
Resume can be initiated by generic code too.
> 2. a failure to suspend leads to worse power consumption,
> while a failure to resume is -EIO, at best
Yes, a failure to resume is a big deal.
> >> PM operations, however, are operating on a tree. A driver requesting
> >> a resume may get an error code. But it has no idea where this error
> >> comes from. The generic code knows at least that.
> >
> > Well, what do you mean by "the generic code"?
>
> In this case the device model, which has the tree and all dependencies.
> Error handling here is potentially very complicated because
>
> 1. a driver can experience an error from a node higher in the tree
Well, there can be an error coming from a parent or a supplier, but
the driver will not receive it directly.
> 2. a driver can trigger a failure in a sibling
> 3. a driver for a node can be less specific than the drivers higher up
I'm not sure I understand the above correctly.
> Reducing this to a single error condition is difficult.
Fair enough.
> Suppose you have a USB device with two interfaces. The driver for A
> initiates a resume. Interface A is resumed; B reports an error.
> Should this block further attempts to suspend the whole device?
It should IMV.
> >> Let's look at at a USB storage device. The request to resume comes
> >> from sd.c. sd.c is certainly not equipped to handle a PCI error
> >> condition that has prevented a USB host controller from resuming.
> >
> > Sure, but this doesn't mean that suspending or resuming the device is
> > a good idea until the error condition gets resolved.
>
> Suspending clearly yes. Resuming is another matter. It has to work
> if you want to operate without errors.
Well, it has to physically work in the first place. If it doesn't,
the rest is a bit moot, because you end up with a non-functional
device that appears to be permanently suspended.
> >> I am afraid this part of the API has issues. And they keep growing
> >> the more we divorce the device driver from the bus driver, which
> >> actually does the PM operation.
> >
> > Well, in general suspending or resuming a device is a collaborative
> > effort and if one of the pieces falls over, making it work again
> > involves fixing up the failing piece and notifying the others that it
> > is ready again. However, that part isn't covered and I'm not sure if
> > it can be covered in a sufficiently generic way.
>
> True. But that still cannot solve the question what is to be done
> if error handling fails. Hence my proposal:
> - record all failures
> - heed the record only when suspending
I guess that would boil down to moving the power.runtime_error update
from rpm_callback() to rpm_suspend()?
Powered by blists - more mailing lists