[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200708135748.l4zncodhhggurp6s@gilmour.lan>
Date: Wed, 8 Jul 2020 15:57:48 +0200
From: Maxime Ripard <maxime@...no.tech>
To: Ondřej Jirman <megous@...ous.com>,
linux-sunxi@...glegroups.com,
Vasily Khoruzhick <anarsoul@...il.com>,
Yangtao Li <tiny.windzz@...il.com>,
Zhang Rui <rui.zhang@...el.com>,
Daniel Lezcano <daniel.lezcano@...aro.org>,
Amit Kucheria <amit.kucheria@...durent.com>,
Chen-Yu Tsai <wens@...e.org>,
"open list:ALLWINNER THERMAL DRIVER" <linux-pm@...r.kernel.org>,
"moderated list:ARM/Allwinner sunXi SoC support"
<linux-arm-kernel@...ts.infradead.org>,
open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] thermal: sun8i: Be loud when probe fails
On Wed, Jul 08, 2020 at 03:44:41PM +0200, Ondřej Jirman wrote:
> On Wed, Jul 08, 2020 at 03:36:54PM +0200, Maxime Ripard wrote:
> > On Wed, Jul 08, 2020 at 03:29:24PM +0200, Ondřej Jirman wrote:
> > > Hello Maxime,
> > >
> > > On Wed, Jul 08, 2020 at 02:25:42PM +0200, Maxime Ripard wrote:
> > > > Hi,
> > > >
> > > > On Wed, Jul 08, 2020 at 12:55:27PM +0200, Ondrej Jirman wrote:
> > > > > I noticed several mobile Linux distributions failing to enable the
> > > > > thermal regulation correctly, because the kernel is silent
> > > > > when thermal driver fails to probe. Add enough error reporting
> > > > > to debug issues and warn users in case thermal sensor is failing
> > > > > to probe.
> > > > >
> > > > > Failing to notify users means, that SoC can easily overheat under
> > > > > load.
> > > > >
> > > > > Signed-off-by: Ondrej Jirman <megous@...ous.com>
> > > > > ---
> > > > > drivers/thermal/sun8i_thermal.c | 55 ++++++++++++++++++++++++++-------
> > > > > 1 file changed, 43 insertions(+), 12 deletions(-)
> > > > >
> > > > > diff --git a/drivers/thermal/sun8i_thermal.c b/drivers/thermal/sun8i_thermal.c
> > > > > index 74d73be16496..9065e79ae743 100644
> > > > > --- a/drivers/thermal/sun8i_thermal.c
> > > > > +++ b/drivers/thermal/sun8i_thermal.c
> > > > > @@ -287,8 +287,12 @@ static int sun8i_ths_calibrate(struct ths_device *tmdev)
> > > > >
> > > > > calcell = devm_nvmem_cell_get(dev, "calibration");
> > > > > if (IS_ERR(calcell)) {
> > > > > + dev_err(dev, "Failed to get calibration nvmem cell (%ld)\n",
> > > > > + PTR_ERR(calcell));
> > > > > +
> > > > > if (PTR_ERR(calcell) == -EPROBE_DEFER)
> > > > > return -EPROBE_DEFER;
> > > > > +
> > > >
> > > > The rest of the patch makes sense, but we should probably put the error
> > > > message after the EPROBE_DEFER return so that we don't print any extra
> > > > noise that isn't necessarily useful
> > >
> > > I thought about that, but in this case this would have helped, see my other
> > > e-mail. Though lack of "probe success" message may be enough for me, to
> > > debug the issue, I'm not sure the user will notice that a message is missing, while
> > > he'll surely notice if there's a flood of repeated EPROBE_DEFER messages.
> >
> > Yeah, but on the other hand, we regularly have people that come up and
> > ask if a "legitimate" EPROBE_DEFER error message (as in: the driver
> > wasn't there on the first attempt but was there on the second) is a
> > cause of concern or not.
>
> That's why I also added a success message, to distinguish this case.
That doesn't really help though. We have plenty of drivers that have
some sort of success message and people will still ask about that error
message earlier.
> > > And people run several distros for 3-4 months without anyone noticing any
> > > issues and that thermal regulation doesn't work. So it seems that lack of a
> > > success message is not enough.
> >
> > I understand what the issue is, but do you really expect phone users to
> > monitor the kernel logs every time they boot their phone to see if the
> > thermal throttling is enabled?
>
> Not phone users, but people making their own kernels/distributions. Those people
> monitor dmesg, and out of 4 distros or more nobody noticed there was an issue
> (despite the complaints of overheating by their users).
>
> So I thought some warning may be in order, so that distro people more easily
> notice they have misconfigured the kernel or sometging.
I mean, then there's nothing we can do to properly address that then.
The configuration system is a gun, we can point at the target, but
anyone is definitely free to shot themself in the foot.
You would have exactly the same result if you left the thermal driver
disabled, or if you didn't have cpufreq support.
Maxime
Powered by blists - more mailing lists