linux-kernel - Re: PM runtime_error handling missing in many drivers?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAJZ5v0i83eJWV_kvWxZvja+Js3tKbrwZ8rVVGn7vR=0qLf1mtw@mail.gmail.com>
Date: Wed, 12 Feb 2025 20:29:34 +0100
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Brian Norris <briannorris@...gle.com>
Cc: Ajay Agarwal <ajayagarwal@...gle.com>, "Rafael J. Wysocki" <rafael@...nel.org>, 
	Oliver Neukum <oneukum@...e.com>, "Rafael J. Wysocki" <rafael.j.wysocki@...el.com>, 
	Vincent Whitchurch <vincent.whitchurch@...s.com>, "jic23@...nel.org" <jic23@...nel.org>, 
	"linux-pm@...r.kernel.org" <linux-pm@...r.kernel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"linux-iio@...r.kernel.org" <linux-iio@...r.kernel.org>, Brian Norris <briannorris@...omium.org>
Subject: Re: PM runtime_error handling missing in many drivers?

On Tue, Feb 11, 2025 at 11:21 PM Brian Norris <briannorris@...gle.com> wrote:
>
> Hi Ajay,
>
> On Mon, Feb 10, 2025 at 09:02:41AM +0530, Ajay Agarwal wrote:
> > On Wed, Jul 27, 2022 at 06:31:48PM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jul 27, 2022 at 10:08 AM Oliver Neukum <oneukum@...e.com> wrote:
> > > > On 26.07.22 17:41, Rafael J. Wysocki wrote:
> > > > > Well, in general suspending or resuming a device is a collaborative
> > > > > effort and if one of the pieces falls over, making it work again
> > > > > involves fixing up the failing piece and notifying the others that it
> > > > > is ready again.  However, that part isn't covered and I'm not sure if
> > > > > it can be covered in a sufficiently generic way.
> > > >
> > > > True. But that still cannot solve the question what is to be done
> > > > if error handling fails. Hence my proposal:
> > > > - record all failures
> > > > - heed the record only when suspending
> > >
> > > I guess that would boil down to moving the power.runtime_error update
> > > from rpm_callback() to rpm_suspend()?
> > Resuming this discussion. One of the ways the device drivers are
> > clearing the runtime_error flag is by calling pm_runtime_set_suspended
> > [1].

I personally think that jumping on a 2.5 years old thread is not a
good idea.  It would be better to restate the problem statement and
provide the link to the previous discussion.

> > To me, it feels weird that a device driver calls pm_runtime_set_suspended
> > if the runtime_resume() has failed. It should be implied that the device
> > is in suspended state if the resume failed.
> >
> > So how really should the runtime_error flag be cleared? Should there be
> > a new API exposed to device drivers for this? Or should we plan for it
> > in the framework itself?
>
> While the API naming is unclear, that's exactly what
> pm_runtime_set_suspended() is about. Personally, I find it nice when a
> driver adds the comment "clear runtime_error flag", because otherwise
> it's not really obvious why a driver has to take care of "suspending"
> after a failed resume. But that's not the biggest question here, IMO.
>
> The real reson I pointed you at this thread was because I think it's
> useful to pursue the proposal above: to avoid setting a persistent
> "runtime_error" for resume failures. This seems to just create a pitfall
> for clients, as asked by Vincent and Oliver upthread.
>
> And along this line, there are relatively few drivers that actually
> bother to reset this error flag ever (e.g., commit f2bc2afe34c1
> ("accel/ivpu: Clear runtime_error after pm_runtime_resume_and_get()
> fails")).
>
> So to me, we should simply answer Rafael's question:
>
> (repeated:)
> > > I guess that would boil down to moving the power.runtime_error update
> > > from rpm_callback() to rpm_suspend()?
>
> Yes, I think so. (Although I'm not sure if this leaves undesirable spam
> where persistent .runtime_resume() failures occur.)
>
> ...and then write/test/submit such a patch, provided it achieves the
> desired results.
>
> Unless of course one of the thread participants here has some other
> update in the intervening 2.5 years, or if Rafael was simply asking the
> above rhetorically, and wasn't actually interested in fielding such a
> change.

The reason why runtime_error is there is to prevent runtime PM
callbacks from being run until something is done about the error,
under the assumption that running them in that case may make the
problem worse.

I'm not sure if I see a substantial difference between suspend and
resume in that respect: If any of them fails, the state of the device
is kind of unstable.  In particular, if resume fails and the device
doesn't actually resume, something needs to be done about it or it
just becomes unusable.

Now, the way of clearing the error may not be super-convenient, which
was a bit hard to figure out upfront, so I'm not against making any
changes as long as there are sufficient reasons for making them.