linux-kernel - Re: [PATCH 0/2] arm64: dts: qcom: sm8650: rework CPU & GPU thermal zones

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ec361e16-4af0-49bc-a7ca-8d8caa3dc332@linaro.org>
Date: Fri, 3 Jan 2025 15:49:39 +0100
From: Neil Armstrong <neil.armstrong@...aro.org>
To: Konrad Dybcio <konrad.dybcio@....qualcomm.com>,
 Bjorn Andersson <andersson@...nel.org>,
 Konrad Dybcio <konradybcio@...nel.org>, Rob Herring <robh@...nel.org>,
 Krzysztof Kozlowski <krzk+dt@...nel.org>, Conor Dooley <conor+dt@...nel.org>
Cc: linux-arm-msm@...r.kernel.org, devicetree@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2] arm64: dts: qcom: sm8650: rework CPU & GPU thermal
 zones

On 03/01/2025 15:43, Konrad Dybcio wrote:
> On 3.01.2025 3:38 PM, Neil Armstrong wrote:
>> On the SM8650 platform, the dynamic clock and voltage scaling (DCVS) for
>> the CPUs and GPU is handled by hardware & firmware using factory and
>> form-factor determined parameters in order to maximize frequency while
>> keeping the temperature way below the junction temperature where the SoC
>> would experience a thermal shutdown if not permanent damages.
>>
>> On the other side, the High Level Ooperating System (HLOS), like Linux,
>> is able to adjust the CPU and GPU frequency using the internal SoC
>> temperature sensors (here tsens) and it's UP/LOW interrupts, but it
>> effectly does the same work twice in an less effective manner.
>>
>> Let's take the Hardware & Firmware action in account and design the
>> thermal zones trip points and cooling devices mapping to use the HLOS
>> as a safety warant in case the platform experiences a temperature surge
>> to helpfully avoid a thermal shutdown and handle the scenario gracefully.
>>
>> On the CPU side, the LMh hardware does the DCVS control loop, so
>> let's set higher trip points temperatures closer to the junction
>> and thermal shutdown temperatures and add some idle injection cooling
>> device with 100% duty cycle for each CPU that would act as emergency
>> action to avoid the thermal shutdown.
>>
>> On the GPU side, the GPU Management Unit (GMU) acts as the DCVS
>> control loop, but since we can't perform idle injection, let's
>> also set higher trip points temperatures closer to the junction
>> and thermal shutdown temperatures to reduce the GPU frequency only
>> as an emergency action before the thermal shutdown.
>>
>> Those 2 changes optimizes the thermal management design by avoiding
>> concurrent thermal management, calculations & avoidable interrupts
>> by moving the HLOS management to a last resort emergency if the
>> Hardware & Firmwares fails to avoid a thermal shutdown.
>>
>> Signed-off-by: Neil Armstrong <neil.armstrong@...aro.org>
>> ---
> 
> Got any numbers to back this?

To back which part ? Yes I've been running loads with difference
scenarios and effectively the hardware work is much better with
a more linear correction and slighly better performances because
it sets slighly higger OPPs while maintaining the core closer to
the target temperature range. Which is kind of expected.

I don't have easy numbers to share, sorry...

So yes I consider avoiding the concurrent effort is better, but
since we also take the firmware design in account in the whole platform
representation in DT (DSPs, SCM, GMU, ...) we should also extend this
to thermal.

Neil

> 
> Konrad