lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <4a068a17-220d-f94e-e8a2-c62d61763922@codeaurora.org>
Date:   Fri, 16 Oct 2020 19:16:52 +0530
From:   Akhil P Oommen <akhilpo@...eaurora.org>
To:     Matthias Kaehlcke <mka@...omium.org>, manafm@...eaurora.org
Cc:     Doug Anderson <dianders@...omium.org>,
        linux-arm-msm <linux-arm-msm@...r.kernel.org>,
        Amit Kucheria <amitk@...nel.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Rajeshwari <rkambl@...eaurora.org>, dri-devel@...edesktop.org,
        freedreno <freedreno@...ts.freedesktop.org>
Subject: Re: [PATCH 1/2] arm64: dts: qcom: sc7180: Add gpu cooling support

On 10/16/2020 3:49 AM, Matthias Kaehlcke wrote:
> Hi,
> 
> On Thu, Oct 15, 2020 at 12:07:01AM +0530, manafm@...eaurora.org wrote:
>> On 2020-10-14 18:59, Akhil P Oommen wrote:
>>> On 10/9/2020 10:27 PM, Matthias Kaehlcke wrote:
>>>> On Fri, Oct 09, 2020 at 08:05:10AM -0700, Doug Anderson wrote:
>>>>> Hi,
>>>>>
>>>>> On Thu, Oct 8, 2020 at 10:10 AM Akhil P Oommen
>>>>> <akhilpo@...eaurora.org> wrote:
>>>>>>
>>>>>> Add cooling-cells property and the cooling maps for the gpu tzones
>>>>>> to support GPU cooling.
>>>>>>
>>>>>> Signed-off-by: Akhil P Oommen <akhilpo@...eaurora.org>
>>>>>> ---
>>>>>>    arch/arm64/boot/dts/qcom/sc7180.dtsi | 29
>>>>>> ++++++++++++++++++++++-------
>>>>>>    1 file changed, 22 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/boot/dts/qcom/sc7180.dtsi
>>>>>> b/arch/arm64/boot/dts/qcom/sc7180.dtsi
>>>>>> index d46b383..40d6a28 100644
>>>>>> --- a/arch/arm64/boot/dts/qcom/sc7180.dtsi
>>>>>> +++ b/arch/arm64/boot/dts/qcom/sc7180.dtsi
>>>>>> @@ -2,7 +2,7 @@
>>>>>>    /*
>>>>>>     * SC7180 SoC device tree source
>>>>>>     *
>>>>>> - * Copyright (c) 2019, The Linux Foundation. All rights reserved.
>>>>>> + * Copyright (c) 2019-20, The Linux Foundation. All rights
>>>>>> reserved.
>>>>>>     */
>>>>>>
>>>>>>    #include <dt-bindings/clock/qcom,dispcc-sc7180.h>
>>>>>> @@ -1885,6 +1885,7 @@
>>>>>>                           iommus = <&adreno_smmu 0>;
>>>>>>                           operating-points-v2 = <&gpu_opp_table>;
>>>>>>                           qcom,gmu = <&gmu>;
>>>>>> +                       #cooling-cells = <2>;
>>>>>
>>>>> Presumably we should add this to the devicetree bindings, too?
>>> Yes, thanks for catching this. Will update in the next patch.
>>>
>>>>>
>>>>>
>>>>>>                           interconnects = <&gem_noc
>>>>>> MASTER_GFX3D &mc_virt SLAVE_EBI1>;
>>>>>>                           interconnect-names = "gfx-mem";
>>>>>> @@ -3825,16 +3826,16 @@
>>>>>>                   };
>>>>>>
>>>>>>                   gpuss0-thermal {
>>>>>> -                       polling-delay-passive = <0>;
>>>>>> +                       polling-delay-passive = <100>;
>>>>>
>>>>> Why did you make this change?  I'm pretty sure that we _don't_ want
>>>>> this since we're using interrupts for the thermal sensor.  See commit
>>>>> 22337b91022d ("arm64: dts: qcom: sc7180: Changed polling mode in
>>>>> Thermal-zones node").
>>>>
>>>> I was going to ask the same, this shouldn't be needed.
>> As per our understanding unlike "polling-delay",  this delay property is
>> intended to activate polling thread on post trip threshold violation and  it
>> is irrespective of sensor is capable for trip interrupt or not.
>> This polling is more of governor related. Below are the few references from
>> Documentation/code which tells polling-delay-passive is needed for IPA for
>> better IPA performance.
>>
>> As per Power allocator documentations
>>
>> 1. https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/driver-api/thermal/power_allocator.rst?h=v5.4.71#n264
>>
>> "The power allocator governor's PID controller works best if there is a
>> periodic tick.  If you have a driver that calls
>> `thermal_zone_device_update()` (or anything that ends up calling the
>> governor's `throttle()` function) repetitively, the governor response
>> won't be very good.  Note that this is not particular to this
>> governor, step-wise will also misbehave if you call its throttle()
>> faster than the normal thermal framework tick (due to interrupts for
>> example) as it will overreact"
>>
>> 2. In Power allocator code, when  switch_on/control trip temp violation, it
>> is enabling passive counter to activate passive polling @ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/thermal/power_allocator.c?h=v5.4.71#n634
>>
>> 3. while calculating derivative term, it is using passive_delay @
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/thermal/power_allocator.c?h=v5.4.71#n243
>>
>> 4. Sensor interrupt will work if temperature is fluctuating between
>> trip_temp and hysteresis. But say a case where we are not enabling
>> polling-delay-passive. In this case if  current temperature > control_temp
>> trip(2nd passive trip) and
>>   temperature trend is still raising, then sensor high trip will be disabled
>> (OR configured for critical trip threshold). No more trip interrupt from
>> sensor until it reaches critical trip or falls below control_temp
>> hysteresis.
>>   How  the governor re-evaluate its next mitigation without passive polling
>> thread  here ?
>>
>> I think the same is required for CPU thermal zone as well.
> 
> Thanks for the explication and pointers!
> 
> I ran some tests to re-confirm. For that I lowered the trip point temperatures
> of CPU6 to 60/70, to make it easier to trigger throttling without necessarily
> affecting the other CPUs. Further I enabled tracing for the events 'thermal_temperature',
> 'thermal_zone_trip' and 'thermal_power_allocator'. With that I ran a CPU
> intensive task on CPU6.
> 
> Without polling-delay the trace log looks like this:
> 
>    irq/40-c263000.-157   [000] ....    48.035986: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=57800 temp=60000
>    irq/40-c263000.-157   [000] ....    48.036029: thermal_power_allocator_pid: thermal_zone_id=6 err=10000 err_integral=0 p=2402 i=0 d=0 output=1776
>    irq/40-c263000.-157   [000] ....    48.036036: thermal_power_allocator: thermal_zone_id=6 req_power={{0x96}} total_req_power=150 granted_power={{0x6f0}} total_granted_power=1776 power_range=1776 max_allocatable_power=1776 current_temperature=60000 delta_temperature=10000
>    irq/40-c263000.-157   [000] ....    52.480888: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=60000 temp=70000
>    irq/40-c263000.-157   [000] ....    52.480925: thermal_power_allocator_pid: thermal_zone_id=6 err=0 err_integral=0 p=0 i=0 d=0 output=1202
>    irq/40-c263000.-157   [000] ....    52.480931: thermal_power_allocator: thermal_zone_id=6 req_power={{0x45e}} total_req_power=1118 granted_power={{0x4b2}} total_granted_power=1202 power_range=1202 max_allocatable_power=1776 current_temperature=70000 delta_temperature=0
> 
> i.e. power_allocator only acts on the sensor interrupts at the trip points
> 
> It looks different with a non-zero value for 'polling-delay-passive':
> 
>    irq/40-c263000.-156   [000] ....   104.501777: thermal_power_allocator: thermal_zone_id=6 req_power={{0x331}} total_req_power=817 granted_power={{0x591}} total_granted_power=1425 power_range=1425 max_allocatable_power=1776 current_temperature=67800 delta_temperature=2200
>    irq/40-c263000.-156   [000] ....   104.523073: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=67800 temp=70000
>    irq/40-c263000.-156   [000] ....   104.523121: thermal_power_allocator_pid: thermal_zone_id=6 err=0 err_integral=-31200 p=0 i=-305 d=0 output=897
>    irq/40-c263000.-156   [000] ....   104.523148: thermal_power_allocator: thermal_zone_id=6 req_power={{0x406}} total_req_power=1030 granted_power={{0x381}} total_granted_power=897 power_range=897 max_allocatable_power=1776 current_temperature=70000 delta_temperature=0
>    irq/40-c263000.-156   [000] ....   104.608566: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=70000 temp=67800
>    irq/40-c263000.-156   [000] ....   104.608612: thermal_power_allocator_pid: thermal_zone_id=6 err=2200 err_integral=-31200 p=528 i=-305 d=0 output=1425
>    irq/40-c263000.-156   [000] ....   104.608642: thermal_power_allocator: thermal_zone_id=6 req_power={{0x331}} total_req_power=817 granted_power={{0x591}} total_granted_power=1425 power_range=1425 max_allocatable_power=1776 current_temperature=67800 delta_temperature=2200
>    irq/40-c263000.-156   [000] ....   104.630863: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=67800 temp=70000
>    irq/40-c263000.-156   [000] ....   104.630907: thermal_power_allocator_pid: thermal_zone_id=6 err=0 err_integral=-31200 p=0 i=-305 d=0 output=897
>    irq/40-c263000.-156   [000] ....   104.630932: thermal_power_allocator: thermal_zone_id=6 req_power={{0x3f4}} total_req_power=1012 granted_power={{0x381}} total_granted_power=897 power_range=897 max_allocatable_power=1776 current_temperature=70000 delta_temperature=0
>    irq/40-c263000.-156   [000] ....   104.687495: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=70000 temp=67800
>    irq/40-c263000.-156   [000] ....   104.687541: thermal_power_allocator_pid: thermal_zone_id=6 err=2200 err_integral=-31200 p=528 i=-305 d=0 output=1425
>    irq/40-c263000.-156   [000] ....   104.687567: thermal_power_allocator: thermal_zone_id=6 req_power={{0x338}} total_req_power=824 granted_power={{0x591}} total_granted_power=1425 power_range=1425 max_allocatable_power=1776 current_temperature=67800 delta_temperature=2200
>    irq/40-c263000.-156   [000] ....   104.711664: thermal_temperature: thermal_zone=cpu6-thermal id=6 temp_prev=67800 temp=70000
> 
> So it seems indeed the 'polling-delay-passive' is needed for better reactivity,
> and it should also be re-added to the other thermal zones.
> 
> On a different note I first tried something similar on the GPU, the trip points
> triggered, however there was no reaction from power_allocator, the reason is
> that there is no power information for the GPU (num power actors = 0). It seems
> it doesn' make sense to use IPA as long as there is no energy model (even if it
> worked by using the lowest frequency as 'estimated power' throttling would
> likely be overly aggressive). Since the trip point configuration for IPA and
> 'step_wise' (and probably others) is somewhat incompatible (IPA aims for a
> temperature around the 2nd trip point, 'step_wise' interprets the first trip
> point as limit) I think it makes sense to continue with a single trip point
> for now.
> 
> Thanks
> 
> Matthias
> _______________________________________________
> dri-devel mailing list
> dri-devel@...ts.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
Looks like we have consensus here. I will spin another patchset.
Manaf will share a separate patch to fix the CPU tzones.

-Akhil.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ