linux-kernel - [PATCH v1 0/2] thermal: core: Handle failed temperature checks more carefully

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <2348857.ElGaqSPkdT@rjwysocki.net>
Date: Thu, 18 Jul 2024 20:57:28 +0200
From: "Rafael J. Wysocki" <rjw@...ysocki.net>
To: Linux PM <linux-pm@...r.kernel.org>
Cc: LKML <linux-kernel@...r.kernel.org>, Lukasz Luba <lukasz.luba@....com>,
 Daniel Lezcano <daniel.lezcano@...aro.org>,
 Neil Armstrong <neil.armstrong@...aro.org>
Subject:
 [PATCH v1 0/2] thermal: core: Handle failed temperature checks more carefully

Hi Everyone,

This series kind of augments

https://lore.kernel.org/linux-pm/4950004.31r3eYUQgx@rjwysocki.net/

so I'm considering adding it to 6.11.

The problem with handing temperature check errors in __thermal_zone_device_update()
after the above is that if someone has a dead thermal zone returning such errors
continuously lurking somewhere in their system, they will get a flood of
"temperature check failed" messages in the log which will be reported as a
regression.  Rightfully, because these messages render the kernel log
practically unusable and the continuous and useless polling of such a thermal
zone may even prevent the system from entering deep idle states.  Clearly,
something needs to be done about this.

One possible approach might be to simply disable the thermal zone in question
after the first error (that is not -EAGAIN) returned by its .get_temp()
callback, but that cannot be done because there are thermal zones in which
.get_temp() returns errors to start with, but they recover later, and they
need to be taken into account.

So the only other alternative that is not overly complicated is to add a
back-off mechanism to the polling, so the thermal zone has a chance to recover,
but the core will not wait for that forever.  At one point it will just disable
the thermal zone and let user space re-enable it if that's regarded as a good
idea.  This is done in the second patch and the first patch is preparatory.

Thanks!