linux-kernel - Re: PM runtime_error handling missing in many drivers?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z7QcjvKRGKgrI8EC@google.com>
Date: Tue, 18 Feb 2025 11:07:19 +0530
From: Ajay Agarwal <ajayagarwal@...gle.com>
To: "Rafael J. Wysocki" <rafael@...nel.org>
Cc: Brian Norris <briannorris@...gle.com>, Oliver Neukum <oneukum@...e.com>,
	"Rafael J. Wysocki" <rafael.j.wysocki@...el.com>,
	Vincent Whitchurch <vincent.whitchurch@...s.com>,
	"jic23@...nel.org" <jic23@...nel.org>,
	"linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-iio@...r.kernel.org" <linux-iio@...r.kernel.org>,
	Brian Norris <briannorris@...omium.org>,
	Joy Chakraborty <joychakr@...gle.com>,
	Vamshi Gajjela <vamshigajjela@...gle.com>,
	Manu Gautam <manugautam@...gle.com>
Subject: Re: PM runtime_error handling missing in many drivers?

On Mon, Feb 17, 2025 at 09:23:18PM +0100, Rafael J. Wysocki wrote:
> On Mon, Feb 17, 2025 at 4:49 AM Ajay Agarwal <ajayagarwal@...gle.com> wrote:
> >
> > On Wed, Feb 12, 2025 at 08:29:34PM +0100, Rafael J. Wysocki wrote:
> > > On Tue, Feb 11, 2025 at 11:21 PM Brian Norris <briannorris@...gle.com> wrote:
> > > >
> > > > Hi Ajay,
> > > >
> > > > On Mon, Feb 10, 2025 at 09:02:41AM +0530, Ajay Agarwal wrote:
> > > > > On Wed, Jul 27, 2022 at 06:31:48PM +0200, Rafael J. Wysocki wrote:
> > > > > > On Wed, Jul 27, 2022 at 10:08 AM Oliver Neukum <oneukum@...e.com> wrote:
> > > > > > > On 26.07.22 17:41, Rafael J. Wysocki wrote:
> > > > > > > > Well, in general suspending or resuming a device is a collaborative
> > > > > > > > effort and if one of the pieces falls over, making it work again
> > > > > > > > involves fixing up the failing piece and notifying the others that it
> > > > > > > > is ready again.  However, that part isn't covered and I'm not sure if
> > > > > > > > it can be covered in a sufficiently generic way.
> > > > > > >
> > > > > > > True. But that still cannot solve the question what is to be done
> > > > > > > if error handling fails. Hence my proposal:
> > > > > > > - record all failures
> > > > > > > - heed the record only when suspending
> > > > > >
> > > > > > I guess that would boil down to moving the power.runtime_error update
> > > > > > from rpm_callback() to rpm_suspend()?
> > > > > Resuming this discussion. One of the ways the device drivers are
> > > > > clearing the runtime_error flag is by calling pm_runtime_set_suspended
> > > > > [1].
> > >
> > > I personally think that jumping on a 2.5 years old thread is not a
> > > good idea.  It would be better to restate the problem statement and
> > > provide the link to the previous discussion.
> > >
> > > > > To me, it feels weird that a device driver calls pm_runtime_set_suspended
> > > > > if the runtime_resume() has failed. It should be implied that the device
> > > > > is in suspended state if the resume failed.
> > > > >
> > > > > So how really should the runtime_error flag be cleared? Should there be
> > > > > a new API exposed to device drivers for this? Or should we plan for it
> > > > > in the framework itself?
> > > >
> > > > While the API naming is unclear, that's exactly what
> > > > pm_runtime_set_suspended() is about. Personally, I find it nice when a
> > > > driver adds the comment "clear runtime_error flag", because otherwise
> > > > it's not really obvious why a driver has to take care of "suspending"
> > > > after a failed resume. But that's not the biggest question here, IMO.
> > > >
> > > > The real reson I pointed you at this thread was because I think it's
> > > > useful to pursue the proposal above: to avoid setting a persistent
> > > > "runtime_error" for resume failures. This seems to just create a pitfall
> > > > for clients, as asked by Vincent and Oliver upthread.
> > > >
> > > > And along this line, there are relatively few drivers that actually
> > > > bother to reset this error flag ever (e.g., commit f2bc2afe34c1
> > > > ("accel/ivpu: Clear runtime_error after pm_runtime_resume_and_get()
> > > > fails")).
> > > >
> > > > So to me, we should simply answer Rafael's question:
> > > >
> > > > (repeated:)
> > > > > > I guess that would boil down to moving the power.runtime_error update
> > > > > > from rpm_callback() to rpm_suspend()?
> > > >
> > > > Yes, I think so. (Although I'm not sure if this leaves undesirable spam
> > > > where persistent .runtime_resume() failures occur.)
> > > >
> > > > ...and then write/test/submit such a patch, provided it achieves the
> > > > desired results.
> > > >
> > > > Unless of course one of the thread participants here has some other
> > > > update in the intervening 2.5 years, or if Rafael was simply asking the
> > > > above rhetorically, and wasn't actually interested in fielding such a
> > > > change.
> > >
> > > The reason why runtime_error is there is to prevent runtime PM
> > > callbacks from being run until something is done about the error,
> > > under the assumption that running them in that case may make the
> > > problem worse.
> > >
> > > I'm not sure if I see a substantial difference between suspend and
> > > resume in that respect: If any of them fails, the state of the device
> > > is kind of unstable.  In particular, if resume fails and the device
> > > doesn't actually resume, something needs to be done about it or it
> > > just becomes unusable.
> > >
> > > Now, the way of clearing the error may not be super-convenient, which
> > > was a bit hard to figure out upfront, so I'm not against making any
> > > changes as long as there are sufficient reasons for making them.
> >
> > I am thinking if we can start with a change to not check runtime_error
> > in rpm_resume, and let it go through even if the previous rpm_resume
> > attempt failed. Something like this:
> >
> > ```
> > static int rpm_resume(struct device *dev, int rpmflags)
> >         trace_rpm_resume(dev, rpmflags);
> >
> >   repeat:
> > -       if (dev->power.runtime_error) {
> > -               retval = -EINVAL;
> > -       } else if (dev->power.disable_depth > 0) {
> > +       if (dev->power.disable_depth > 0) {
> >                 if (dev->power.runtime_status == RPM_ACTIVE &&
> >                     dev->power.last_status == RPM_ACTIVE)
> >                         retval = 1;
> > ```
> >
> > I think setting the runtime_error in rpm_callback, i.e. for both resume
> > and suspend is still a good idea for book-keeping purposes, e.g. the
> > user reading the runtime_status of the device from sysfs.
> 
> What would be the benefit of this change?
The benefit would be that the runtime_resume would be re-attempted even if
the previous attempt failed.