lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8d91a3c1-018f-495b-83be-979b795b5548@linaro.org>
Date: Wed, 3 Jul 2024 14:25:28 +0200
From: Daniel Lezcano <daniel.lezcano@...aro.org>
To: neil.armstrong@...aro.org, "Rafael J. Wysocki" <rjw@...ysocki.net>,
 Linux PM <linux-pm@...r.kernel.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Lukasz Luba <lukasz.luba@....com>,
 Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>,
 Zhang Rui <rui.zhang@...el.com>,
 linux-arm-msm <linux-arm-msm@...r.kernel.org>
Subject: Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone
 temperature is invalid


Hi Neil,

it seems there is something wrong with the driver actually.

There can be a moment where the sensor is not yet initialized for 
different reason, so reading the temperature fails. The routine will 
just retry until the sensor gets ready.

Having these errors seem to me that the sensor for this specific thermal 
zone is never ready which may be the root cause of your issue. The 
change is spotting this problem IMO.


On 03/07/2024 12:54, Neil Armstrong wrote:
> Hi,
> 
> On 28/06/2024 14:10, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>>
>> Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
>> if zone temperature is invalid") caused __thermal_zone_device_update()
>> to return early if the current thermal zone temperature was invalid.
>>
>> This was done to avoid running handle_thermal_trip() and governor
>> callbacks in that case which led to confusion.  However, it went too
>> far because monitor_thermal_zone() still needs to be called even when
>> the zone temperature is invalid to ensure that it will be updated
>> eventually in case thermal polling is enabled and the driver has no
>> other means to notify the core of zone temperature changes (for example,
>> it does not register an interrupt handler or ACPI notifier).
>>
>> Also if the .set_trips() zone callback is expected to set up monitoring
>> interrupts for a thermal zone, it has to be provided with valid
>> boundaries and that can only happen if the zone temperature is known.
>>
>> Accordingly, to ensure that __thermal_zone_device_update() will
>> run again after a failing zone temperature check, make it call
>> monitor_thermal_zone() regardless of whether or not the zone
>> temperature is valid and make the latter schedule a thermal zone
>> temperature update if the zone temperature is invalid even if
>> polling is not enabled for the thermal zone.
>>
>> Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() 
>> if zone temperature is invalid")
>> Reported-by: Daniel Lezcano <daniel.lezcano@...aro.org>
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
>> ---
>>   drivers/thermal/thermal_core.c |    5 ++++-
>>   drivers/thermal/thermal_core.h |    6 ++++++
>>   2 files changed, 10 insertions(+), 1 deletion(-)
>>
>> Index: linux-pm/drivers/thermal/thermal_core.c
>> ===================================================================
>> --- linux-pm.orig/drivers/thermal/thermal_core.c
>> +++ linux-pm/drivers/thermal/thermal_core.c
>> @@ -300,6 +300,8 @@ static void monitor_thermal_zone(struct
>>           thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
>>       else if (tz->polling_delay_jiffies)
>>           thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
>> +    else if (tz->temperature == THERMAL_TEMP_INVALID)
>> +        thermal_zone_device_set_polling(tz, 
>> msecs_to_jiffies(THERMAL_RECHECK_DELAY_MS));
>>   }
>>   static struct thermal_governor *thermal_get_tz_governor(struct 
>> thermal_zone_device *tz)
>> @@ -514,7 +516,7 @@ void __thermal_zone_device_update(struct
>>       update_temperature(tz);
>>       if (tz->temperature == THERMAL_TEMP_INVALID)
>> -        return;
>> +        goto monitor;
>>       tz->notify_event = event;
>> @@ -536,6 +538,7 @@ void __thermal_zone_device_update(struct
>>       thermal_debug_update_trip_stats(tz);
>> +monitor:
>>       monitor_thermal_zone(tz);
>>   }
>> Index: linux-pm/drivers/thermal/thermal_core.h
>> ===================================================================
>> --- linux-pm.orig/drivers/thermal/thermal_core.h
>> +++ linux-pm/drivers/thermal/thermal_core.h
>> @@ -133,6 +133,12 @@ struct thermal_zone_device {
>>       struct thermal_trip_desc trips[] __counted_by(num_trips);
>>   };
>> +/*
>> + * Default delay after a failing thermal zone temperature check before
>> + * attempting to check it again.
>> + */
>> +#define THERMAL_RECHECK_DELAY_MS    100
>> +
>>   /* Default Thermal Governor */
>>   #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
>>   #define DEFAULT_THERMAL_GOVERNOR       "step_wise"
>>
>>
>>
>>
> 
> This patch on next-20240702 makes Qualcomm HDK8350, HDK8450, QRD8550, 
> HDK8560, QRD8650 & HDK8650 output in loop:
> 
> thermal thermal_zoneXX: failed to read out thermal zone (-19)
> 
> Boot logs or ARM64 defconfig:
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152439#L1393
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152440#L2200
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152442#L2828
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152441#L1862
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152443#L1776
> https://git.codelinaro.org/linaro/qcomlt/ci/staging/cdba-tester/-/jobs/152444#L1723
> 
> Result of git bisect:
> # bad: [82e4255305c554b0bb18b7ccf2db86041b4c8b6e] Add linux-next 
> specific files for 20240702
> # good: [22a40d14b572deb80c0648557f4bd502d7e83826] Linux 6.10-rc6
> git bisect start 'FETCH_HEAD' 'v6.10-rc6'
> # bad: [f6dfcf0e9567b57b93f2564966d9177f0d8dbe05] Merge branch 'master' 
> of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git
> git bisect bad f6dfcf0e9567b57b93f2564966d9177f0d8dbe05
> # good: [7f86ae0c2dc19fea7be1da29b2bf03f085463ae7] Merge branch 
> 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git
> git bisect good 7f86ae0c2dc19fea7be1da29b2bf03f085463ae7
> # bad: [077d5bbd75dd12af2096c96846ffc78ab5dd65b1] Merge branch 
> 'devfreq-next' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux.git
> git bisect bad 077d5bbd75dd12af2096c96846ffc78ab5dd65b1
> # good: [271bcaf753d0afe2bd0386ab1e98132ee65b61ca] Merge branch 
> 'for-next' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux.git
> git bisect good 271bcaf753d0afe2bd0386ab1e98132ee65b61ca
> # good: [9758a2ee5316a6f8736ab4fd39a6f6176aa057ec] Merge branch 
> 'hwmon-next' of 
> git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging.git
> git bisect good 9758a2ee5316a6f8736ab4fd39a6f6176aa057ec
> # good: [e6bd69ea345045520bd63487b85a4b5676aff76b] Merge branch 'master' 
> of git://linuxtv.org/mchehab/media-next.git
> git bisect good e6bd69ea345045520bd63487b85a4b5676aff76b
> # good: [46398edfb36e2882be5e86ea563b2db9138ae499] Merge branches 
> 'pm-cpuidle' and 'pm-powercap' into linux-next
> git bisect good 46398edfb36e2882be5e86ea563b2db9138ae499
> # good: [d3927cbc52eed166f74ea7e031ed6384cc3d4d5f] Merge branch 
> 'thermal-intel' into linux-next
> git bisect good d3927cbc52eed166f74ea7e031ed6384cc3d4d5f
> # good: [ce84b7beeb524e7b20983838687862454ba54df7] cpufreq: sti: add 
> missing MODULE_DEVICE_TABLE entry for stih418
> git bisect good ce84b7beeb524e7b20983838687862454ba54df7
> # bad: [fcf61315d38d41f4e55856b179f9e5538e299ef4] Merge branch 
> 'thermal-fixes' into linux-next
> git bisect bad fcf61315d38d41f4e55856b179f9e5538e299ef4
> # good: [4262b8d782a74c7cf7b8b94ed9e4fcb94e856d1e] dt-bindings: thermal: 
> mediatek: Fix thermal zone definition for MT8186
> git bisect good 4262b8d782a74c7cf7b8b94ed9e4fcb94e856d1e
> # good: [7eeb114a635a04bea2fa7d57cedbf374c714d29e] dt-bindings: thermal: 
> convert hisilicon-thermal.txt to dt-schema
> git bisect good 7eeb114a635a04bea2fa7d57cedbf374c714d29e
> # good: [107ac0d49ae6a86b4986146b9a612294f7e34406] Merge branch 
> 'thermal/linux-next' of 
> ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/thermal/linux into 
> linux-next
> git bisect good 107ac0d49ae6a86b4986146b9a612294f7e34406
> # bad: [5725f40698b9ba7f84fbfee25b9059ba044c4b86] thermal: core: Call 
> monitor_thermal_zone() if zone temperature is invalid
> git bisect bad 5725f40698b9ba7f84fbfee25b9059ba044c4b86
> # first bad commit: [5725f40698b9ba7f84fbfee25b9059ba044c4b86] thermal: 
> core: Call monitor_thermal_zone() if zone temperature is invalid
> 
> #regzbot introduced: 5725f40698b9ba7f84fbfee25b9059ba044c4b86
> 
> Thanks,
> Neil

-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ