lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f13a6648-cd21-a166-6e56-431ac7ee5dec@collabora.com>
Date:   Wed, 13 Sep 2023 13:43:53 +0200
From:   AngeloGioacchino Del Regno 
        <angelogioacchino.delregno@...labora.com>
To:     frank-w@...lic-files.de, Frank Wunderlich <linux@...web.de>,
        linux-mediatek@...ts.infradead.org
Cc:     "Rafael J. Wysocki" <rafael@...nel.org>,
        Daniel Lezcano <daniel.lezcano@...aro.org>,
        Amit Kucheria <amitk@...nel.org>,
        Zhang Rui <rui.zhang@...el.com>,
        Rob Herring <robh+dt@...nel.org>,
        Krzysztof Kozlowski <krzysztof.kozlowski+dt@...aro.org>,
        Conor Dooley <conor+dt@...nel.org>,
        Matthias Brugger <matthias.bgg@...il.com>,
        linux-pm@...r.kernel.org, devicetree@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-arm-kernel@...ts.infradead.org,
        Daniel Golle <daniel@...rotopia.org>
Subject: Re: [RFC v1 3/3] thermal/drivers/mediatek/lvts_thermal: add mt7988
 support

Il 13/09/23 12:52, Frank Wunderlich ha scritto:
> Am 13. September 2023 10:16:51 MESZ schrieb AngeloGioacchino Del Regno <angelogioacchino.delregno@...labora.com>:
> Hi angelo,
> 
> thanks for first look
> 
>> Il 11/09/23 20:33, Frank Wunderlich ha scritto:
>>> From: Frank Wunderlich <frank-w@...lic-files.de>
>>>
>>> Add Support for mediatek fologic 880/MT7988.
>>>
>>> Signed-off-by: Frank Wunderlich <frank-w@...lic-files.de>
>>> ---
>>>    drivers/thermal/mediatek/lvts_thermal.c | 73 +++++++++++++++++++++++++
>>>    1 file changed, 73 insertions(+)
>>>
>>> diff --git a/drivers/thermal/mediatek/lvts_thermal.c b/drivers/thermal/mediatek/lvts_thermal.c
>>> index c1004b4da3b6..48b257a3c80e 100644
>>> --- a/drivers/thermal/mediatek/lvts_thermal.c
>>> +++ b/drivers/thermal/mediatek/lvts_thermal.c
>>> @@ -82,6 +82,8 @@
>>>    #define LVTS_GOLDEN_TEMP_DEFAULT	50
>>>    #define LVTS_COEFF_A_MT8195			-250460
>>>    #define LVTS_COEFF_B_MT8195			250460
>>> +#define LVTS_COEFF_A_MT7988			-204650
>>> +#define LVTS_COEFF_B_MT7988			204650
>>>      #define LVTS_MSR_IMMEDIATE_MODE		0
>>>    #define LVTS_MSR_FILTERED_MODE		1
>>> @@ -1272,6 +1274,67 @@ static int lvts_remove(struct platform_device *pdev)
>>>    	return 0;
>>>    }
>>>    +/*
>>> + * LVTS MT7988
>>> + */
>>> +#define LVTS_HW_SHUTDOWN_MT7988	117000
>>
>> Are you sure that this chip's Tj is >117°C ?!
>>
>> Looks a bit high... if it is exactly 117°C, I would suggest cutting earlier,
>> either at 110 (safe side) or 115: after all, this is a life-saver feature and
>> the chip is actually never meant to *constantly* work at 110°C (as it would
>> degrade fast and say goodbye earlier than "planned").
> 
> I took values from SDK
> 
> https://git01.mediatek.com/plugins/gitiles/openwrt/feeds/mtk-openwrt-feeds/+/refs/heads/master/target/linux/mediatek/files-5.4/drivers/thermal/mediatek/soc_temp_lvts.c#1483
> 

That kernel also defines 117°C for MT8195, which leaves me a bit dubious.

For safety, I would recommend using the same 105°C hw shutdown temperature
for 7988 as well: after all if you think about it, reaching that kind of
temperature means that there's a real problem (fan stopped working and/or
heatsink detached) that needs attention.

This is because you'll have trip points in devicetree that should throttle
the CPU in case it reaches at least a dangerously high temperature (read:
a temperature outside the recommended continuous operation range), bringing
the temperatures down because of the throttling action; I would imagine
throttling the CPU a bit down at 80°C, then a bit more at 90°C - but then,
if the temps won't drop like that, there's clearly a HW-related issue that
must be addressed (like the fan/heatsink scenario that I just described).

Though, take this as an advice; I'll respect whichever decision you'll take.

>>> +//enum mt7988_lvts_domain { MT7988_AP_DOMAIN, MT7988_NUM_DOMAIN };
>>> +
>>> +enum mt7988_lvts_sensor_enum {
>>> +	MT7988_TS3_0,
>>> +	MT7988_TS3_1,
>>> +	MT7988_TS3_2,
>>> +	MT7988_TS3_3,
>>> +	MT7988_TS4_0,
>>> +	MT7988_TS4_1,
>>> +	MT7988_TS4_2,
>>> +	MT7988_TS4_3,
>>> +	MT7988_NUM_TS
>>> +};
> 
>> This enumeration should be definitions in bindings (mediatek,lvts-thermal.h).
>>
>> Besides, the LVTS is about internal temperatures, so those TS3_x and 4_x can
>> be renamed like what was done for MT8192 and MT8195: this is because you will
>> never see TS3_2 being CPU2 on a board and CPU4 on another, being those - again -
>> internal to the SoC, hence unchangeable.
> 
> Right these sensors are internally only and i took naming from sdk to avoid confusion. And i have not more information about these internal sensors (special meaning),but their values are packed together to get the resulting (average) temperature.
> 
>> Another reason is that you'll anyway have to refer to those sensors in the
>> devicetree to configure thermal trips and such, so... :-)
> 
> In device tree it will look like this:
> 
> https://github.com/frank-w/BPI-Router-Linux/blob/6.5-lvts/arch/arm64/boot/dts/mediatek/mt7988a.dtsi#L771
> 
> Daniel has also defined thermal trips there,but these are untested atm. I only verified temperature itself i get from sysfs as far as i can (start at ~40°C and reaching ~70 while running).
> 

Check how it's done in mt8192.dtsi and mt8195.dtsi: we're using the definitions
from the bindings for thermal zones.
At least for consistency (apart from other even more valid considerations), that's
how it should be done: please do it like that.

Besides, I think that you can definitely guess what `TS` is CPU related: checking
the driver and devicetree from the downstream kernel that you provided, what I
understand is:

1. Your naming TS3,4 corresponds to TS2,3 downstream (so it's wrong?)
2. Downstream TS2 is related to the CPU cores, so it should be
    - TS2_0 - CPU0
    - TS2_1 - CPU1
    - TS2_2 - CPU2
    - TS2_3 - CPU3

The other set of thermal sensors seem to be completely unused, so we cannot guess
just by looking at the code. Since you have the hardware, you may be able to take
a position about what they are by checking what sensor heats up for each "action"
(could be ethernet controller or infra or whatever else); if in doubt, just leave
them out, but add a comment saying that there are more sensors and say where, so
that if anyone finds what they're for, it'll be easy to add them.

In any case, adding thermal sensors that you don't know about is meaningless, as
the sense of all this is:
  - Monitoring temperatures of the hardware
  - Taking action to prevent temperature overrun
but if you don't know what you're reading, you can't interpret temperatures for
something unknown, hence you can't as well take action to prevent overruns, as
you won't know what's the maximum temperature for "unknown thing" :-)

Regards,
Angelo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ