linux-kernel - Re: [PATCH v6 4/6] arm64: dts: mediatek: mt8186: add default thermal zones

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <940eec49-91d8-4d38-a3d8-e1b7e090b905@baylibre.com>
Date: Mon, 3 Jun 2024 09:58:38 +0200
From: Julien Panis <jpanis@...libre.com>
To: AngeloGioacchino Del Regno <angelogioacchino.delregno@...labora.com>,
 Chen-Yu Tsai <wenst@...omium.org>
Cc: Rob Herring <robh@...nel.org>, Krzysztof Kozlowski <krzk+dt@...nel.org>,
 Conor Dooley <conor+dt@...nel.org>, Matthias Brugger
 <matthias.bgg@...il.com>, Daniel Lezcano <daniel.lezcano@...aro.org>,
 Nicolas Pitre <npitre@...libre.com>, "Rafael J. Wysocki"
 <rafael@...nel.org>, Zhang Rui <rui.zhang@...el.com>,
 Lukasz Luba <lukasz.luba@....com>, devicetree@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
 linux-mediatek@...ts.infradead.org, linux-pm@...r.kernel.org,
 Krzysztof Kozlowski <krzk@...nel.org>
Subject: Re: [PATCH v6 4/6] arm64: dts: mediatek: mt8186: add default thermal
 zones

On 5/29/24 14:06, AngeloGioacchino Del Regno wrote:
> Il 29/05/24 11:12, Julien Panis ha scritto:
>> On 5/29/24 10:33, Chen-Yu Tsai wrote:
>>> On Wed, May 29, 2024 at 4:17 PM AngeloGioacchino Del Regno
>>> <angelogioacchino.delregno@...labora.com> wrote:
>>>> Il 29/05/24 07:57, Julien Panis ha scritto:
>>>>> From: Nicolas Pitre <npitre@...libre.com>
>>>>>
>>>>> Inspired by the vendor kernel but adapted to the upstream thermal
>>>>> driver version.
>>>>>
>>>>> Signed-off-by: Nicolas Pitre <npitre@...libre.com>
>>>>> Signed-off-by: Julien Panis <jpanis@...libre.com>
>>>> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@...labora.com>
>>> I'm getting some crazy readings which would cause the machine to
>>> immediately shutdown during boot. Anyone else see this? Or maybe
>>> my device has bad calibration data?
>>>
>>> gpu_thermal-virtual-0
>>> Adapter: Virtual device
>>> temp1:       +229.7 C
>>>
>>> nna_thermal-virtual-0
>>> Adapter: Virtual device
>>> temp1:       +229.7 C
>>>
>>> cpu_big0_thermal-virtual-0
>>> Adapter: Virtual device
>>> temp1:         -7.2 C
>>>
>>> cpu_little2_thermal-virtual-0
>>> Adapter: Virtual device
>>> temp1:       +157.2 C
>>>
>>> cpu_little0_thermal-virtual-0
>>> Adapter: Virtual device
>>> temp1:       -277.1 C
>>>
>>> adsp_thermal-virtual-0
>>> Adapter: Virtual device
>>> temp1:       +229.7 C
>>>
>>> cpu_big1_thermal-virtual-0
>>> Adapter: Virtual device
>>> temp1:       +229.7 C
>>>
>>> cam_thermal-virtual-0
>>> Adapter: Virtual device
>>> temp1:        +45.4 C
>>>
>>> cpu_little1_thermal-virtual-0
>>> Adapter: Virtual device
>>> temp1:       -241.8 C
>>
>> It's likely that your device has bad calibration data indeed. We observed the same
>> behavior on the mt8186 device we used (a Corsola) and finally realized that the
>> golden temperature was 0 (device not properly calibrated).
>>
>> To make a comparison, we run chromiumos v5.15 and dmesg output was:
>> 'This sample is not calibrated, fake !!'
>> Additional debugging revealed that the golden temp was actually 0. As a result,
>> chromiumos v5.15 does not use the calibration data. It uses some default values
>> instead. That's why you can observe good temperatures with chromiumos v5.15
>> even with a device that is not calibrated.
>>
>> This feature is not implemented in the driver upstream, so you need a device
>> properly calibrated to get good temperatures with it. When we forced this
>> driver using the default values used by chromiumos v5.15 instead of real calib
>> data (temporarily, just for testing), the temperatures were good.
>>
>> Please make sure your device is properly calibrated: 0 < golden temp < 62.
>>
>
> Wait wait wait wait.
>
> What's up with that calibration data stuff?
>
> If there's any device that cannot use the calibration data, we need a way to
> recognize whether the provided data (read from efuse, of course) is valid,
> otherwise we're creating an important regression here.
>
> "This device is unlucky" is not a good reason to have this kind of regression.
>
> Since - as far as I understand - downstream can recognize that, upstream should
> do the same.
> I'd be okay with refusing to even probe this driver on such devices for the
> moment being, as those are things that could be eventually handled on a second
> part series, even though I would prefer a kind of on-the-fly calibration or
> anyway something that would still make the unlucky ones to actually have good
> readings *right now*.
>
> Though, the fact that you assert that you observed this behavior on one of your
> devices and *still decided to send that upstream* is, in my opinion, unacceptable.
>
> Regards,
> Angelo

I've been trying to find some more information about the criteria
"device calibrated VS device not calibrated" because there's a
confusing comment in downstream code (the comment does not
match what I observe on my device). I'll send a separate patch
to add this feature over the next few days, when I get additional
information from MTK about this criteria.