lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <531F8735.1010203@netscape.net>
Date:	Tue, 11 Mar 2014 22:59:17 +0100
From:	Manuel Krause <manuelkrause@...scape.net>
To:	"Rafael J. Wysocki" <rjw@...ysocki.net>,
	linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org,
	rui.zhang@...el.com
CC:	Guenter Roeck <linux@...ck-us.net>,
	Jean Delvare <jdelvare@...e.de>, lm-sensors@...sensors.org
Subject: Re: 3.13.?: Strange / dangerous fan policy...

On 2014-03-10 02:49, Manuel Krause wrote:
> On 2014-03-09 18:58, Rafael J. Wysocki wrote:
>> On Sunday, March 09, 2014 01:10:25 AM Manuel Krause wrote:
>>> On 2014-03-08 16:59, Guenter Roeck wrote:
>>>> On 03/08/2014 03:08 AM, Jean Delvare wrote:
>>>>> On Fri, 7 Mar 2014 14:52:30 -0800, Guenter Roeck wrote:
>>>>>> On Fri, Mar 07, 2014 at 11:04:29PM +0100, Manuel Krause wrote:
>>>>>>> Hi, and thanks for the quick response!
>>>>>>> No special fancy "fan control policy". 'fancontrol' isn't
>>>>>>> up or
>>>>>>> running.
>>>>>>> Vanilla kernels 3.11.* and 3.12.* had been working on here
>>>>>>> without
>>>>>>> any extra work.
>>>>>>> --
>>>>>>> # sensors
>>>>>>> acpitz-virtual-0
>>>>>>> Adapter: Virtual device
>>>>>>> temp1:        +71.0°C  (crit = +256.0°C)
>>>>>>> temp2:        +69.0°C  (crit = +110.0°C)
>>>>>>> temp3:        +52.0°C  (crit = +105.0°C)
>>>>>>> temp4:        +25.0°C  (crit = +110.0°C)
>>>>>>> temp5:        +58.0°C  (crit = +110.0°C)
>>>>>>>
>>>>>>> coretemp-isa-0000
>>>>>>> Adapter: ISA adapter
>>>>>>> Core 0:       +62.0°C  (high = +105.0°C, crit = +105.0°C)
>>>>>>> Core 1:       +60.0°C  (high = +105.0°C, crit = +105.0°C)
>>>>>>> --
>>>>>>> My notebook (HP/Compaq 6730b) does not have a seperate fan
>>>>>>> sensor.
>>>>>>> This is with 3.12.13 with my normal workload.
>>>>>>>
>>>>>>> Please, trust my above mentionned values of 94 °C vs. 74°C
>>>>>>> as I
>>>>>>> don't like to boot 3.13.6 anymore, to avoid harm to the
>>>>>>> notebook's
>>>>>>> casing.
>>>>>>
>>>>>> Understood. Unfortunately, we'll need to get information
>>>>>> from the new kernel to be able to track down the problem.
>>>>>
>>>>> Indeed. Not only the run-time temperatures, but also the high
>>>>> and crit
>>>>> limits.
>>>>>
>>>>>>> But I'd do to test any improvement-patch.
>>>>>>
>>>>>> So far I have no idea what is going on. I don't see anything
>>>>>> in the
>>>>>> drivers providing above data that would explain the behavior,
>>>>>> but I might be missing something.
>>>>>
>>>>> Looks like a regression in the acpi subsystem or in power
>>>>> management,
>>>>> not hwmon. Hwmon is merely reporting the temperatures, it's not
>>>>> responsible for the actual temperatures.
>>>>>
>>>>
>>>> I would agree. I don't think we have enough information to be
>>>> sure,
>>>> though. There might be some unintended interaction or
>>>> interference.
>>>>
>>>> gpu is a good hint ... for example, look at commit b9ed919f1c8
>>>> (drm/nouveau/drm/pm: remove everything except the hwmon
>>>> interfaces
>>>> to THERM). nouveau does export pwm and fan control information,
>>>> so any change in that code may have unintended side effects.
>>>> Similar, I don't know how ec39f64bba (drm/radeon/dpm: Convert to
>>>> use devm_hwmon_register_with_groups) could have the observed
>>>> impact,
>>>> as it is purely passive, but I prefer to be rather safe than
>>>> sorry.
>>>>
>>>> This problem has now been submitted into bugzilla as
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=71711.
>>>>
>>>> Guenter
>>>>
>>>
>>> Sorry, for beeing late, had to search for/accumulate much info
>>> for you...
>>> I hope, you like me to put it into one answer to you all CCing
>>> you.
>>>
>>> My GFX is a GM45 Intel (mobile), shared memory, running the
>>> opensource Mesa drivers/extensions.
>>> kernel-module: i915
>>>
>>> According to the output of 'cpupower': I have
>>> CPUidle driver: acpi_idle
>>> CPUidle governor: menu
>>>
>>> CPUfreq:
>>>     driver: acpi-cpufreq
>>>     available cpufreq governors: ondemand, performance
>>> -
>>> And "ondemand" is running.
>>> --
>>>
>>> # sensors
>>> acpitz-virtual-0
>>> Adapter: Virtual device
>>> temp1:        +41.0°C  (crit = +256.0°C)
>>> temp2:        +92.0°C  (crit = +110.0°C)
>>> temp3:        +71.0°C  (crit = +105.0°C)
>>> temp4:        +26.5°C  (crit = +110.0°C)
>>> temp5:        +25.0°C  (crit = +110.0°C)
>>>
>>> coretemp-isa-0000
>>> Adapter: ISA adapter
>>> Core 0:       +86.0°C  (high = +105.0°C, crit = +105.0°C)
>>> Core 1:       +84.0°C  (high = +105.0°C, crit = +105.0°C)
>>>
>>> FROM a critical "smelly" situation today, kernel-compilation, fan
>>> @100%.
>>> --
>>>
>>> Additional findings:
>>>
>>> Identification from bootup ACPI initialisation vs. sensors:
>>> temp1 = DTSZ
>>> temp2 = CPUZ --> triggering Cooling in 3.12.13 if > 74°C
>>> temp3 = SKNZ
>>> temp4 = BATZ "Battery Zone" always calm ~ +6°C of ambient T
>>> temp5 = FDTZ --- in 3.12.13 a representation of the cooling-fan
>>> (25 - 45 - 58 - max?)
>>> Core 0 & Core 1 are the internal CPU T sensors.
>>>
>>> With the 3.13.x (.5+) kernels the first gatherered cooling
>>> settings from bootup do stay forever. Means, rebooting a hot
>>> system will get a FDTZ @45°C+ and won't make any problems, as it
>>> does cool enough (even for kernel compiling on here). If it gets
>>> 25°C @bootup the system goes into emergency cooling somewhen.
>>> Same is with a suspend/resume.
>>>
>>> Kernel 3.12.13 adjusts the cooling on it's own, but
>>> appropriately.
>>
>> This almost certainly is an ACPI regression, but I'm not sure
>> whether
>> thermal management or CPU power management is broken on your
>> system.
>>
>> Can you compare the contents of /sys/class/thermal/ from
>> working and
>> not working kernels, please?
>>
>> Rafael
>>
>
> Hi again,
> unfortunately you didn't specify how deeply I should dig into
> /sys/class/thermal. So you get the lines from # BOF # to # EOF #
> below. I hope they're readable without more comments.
>
> The most remarkable changes, in my eyes, had happened within
> "thermal_zone1".
>
> Best regards,
> Manuel Krause
>
>
> # BOF #
> Following ones are all from /sys/class/thermal/ which are links
> to -> ../../devices/virtual/thermal/
>
> I've listed the directories in sections of cooling_devices and
> thermal_zones separately for each bad/good kernel. For Emailing
> purposes only. You can merge them into a spreadsheet for your
> evaluation on your own. I've left out reporting some subdirs and
> subdir's values that _really_ didn't seem to need attention.
>
> Also, I've had collected the #sensors output for each readout,
> having reproduced nearly the same workload, represented by the
> "Fan speed" (thermal_zone4==FDTZ).
>
> And I've done my very best to not produce typos or c&p errors.
>
>
>   3.13.5 -- 20140309 -- 20:52 -- bad
> =============================
> dir             |-
>                   /type       /cur_state  /max_state
> cooling_device0  Processor    0          10
> cooling_device1  Processor    0          10
> cooling_device2  Fan          0           1
> cooling_device3  Fan          1           1
> cooling_device4  Fan          0           1
> cooling_device5  Fan          0           1
> cooling_device6  Fan          0           1
> cooling_device7  LCD          0          24
>
>   3.12.13 -- 20140310 -- 00:26 -- good
> ==============================
> dir             |-
>                   /type       /cur_state  /max_state
> cooling_device0  Processor    0          10
> cooling_device1  Processor    0          10
> cooling_device2  Fan          0           1
> cooling_device3  Fan          1           1
> cooling_device4  Fan          1           1
> cooling_device5  Fan          1           1
> cooling_device6  Fan          1           1
> cooling_device7  LCD          0          24
>
>
>   3.13.5 -- 20140309 -- 20:52 -- bad
> =============================
> dir          |-
>                /passive /temp  |-     /cdev?_  /trip_   /trip_
>                                        trip_    point_   point_
>                                        point    ?_temp   ?_type
> thermal_zone0  0        68000   ?=0    n.a.   256000   critical
> thermal_zone1   n.a.    70000 |-
>                                  ?=0   6       110000   critical
>                                  ?=1   5       107000   passive
>                                  ?=2   4        90000   active
>                                  ?=3   3        75000   active
>                                  ?=4   2        55000   active
>                                  ?=5   1        45000   active
>                                  ?=6   1        30000   active
> thermal_zone2   n.a.    54000 |-
>                                  ?=0   1       105000   critical
>                                  ?=1   1        95000   passive
> thermal_zone3   n.a.    25800 |-
>                                  ?=0   1       110000   critical
>                                  ?=1   1        60000   passive
> thermal_zone4  0        58000   ?=0    n.a.   110000   critical
>
>
>   3.12.13 -- 20140310 -- 00:26 -- good
> ==============================
> dir          |-
>                /passive /temp  |-     /cdev?_  /trip_   /trip_
>                                        trip_    point_   point_
>                                        point    ?_temp   ?_type
> thermal_zone0  0        50000   ?=0    n.a.   256000   critical
> thermal_zone1   n.a.    70000 |-
>                                  ?=0   1       110000   critical
>                                  ?=1   1       107000   passive
>                                  ?=2   2        90000   active
>                                  ?=3   3        67000   active
>                                  ?=4   4        55000   active
>                                  ?=5   5        45000   active
>                                  ?=6   6        30000   active
> thermal_zone2   n.a.    53000 |-
>                                  ?=0   1       105000   critical
>                                  ?=1   1        95000   passive
> thermal_zone3   n.a.    25600 |-
>                                  ?=0   1       110000   critical
>                                  ?=1   1        60000   passive
> thermal_zone4  0        58000   ?=0    n.a.   110000   critical
>
> ---
> Legend here:
>         /type  is always  acpitz
>         /mode             enabled
>         /policy           step_wise
>
>        - from kernel ACPI initialisation: thermal_zone0==DTSZ,
>           thermal_zone1==CPUZ, thermal_zone2==SKNZ,
>           thermal_zone3==BATZ, thermal_zone4==FDTZ
>        - n.a. means      file or value is not available
> ___
> Legend in general:
>               /power/control          is always  auto
>               /power/runtime_status              unsupported
>               /uevent                            ''==empty
>
> ----------------------------------------------------------------
>
>   3.13.5 -- 20140309 -- 20:52 -- bad
> =============================
> # sensors
> acpitz-virtual-0
> Adapter: Virtual device
> temp1:        +68.0°C  (crit = +256.0°C)
> temp2:        +70.0°C  (crit = +110.0°C)
> temp3:        +54.0°C  (crit = +105.0°C)
> temp4:        +25.8°C  (crit = +110.0°C)
> temp5:        +58.0°C  (crit = +110.0°C)
>
> coretemp-isa-0000
> Adapter: ISA adapter
> Core 0:       +66.0°C  (high = +105.0°C, crit = +105.0°C)
> Core 1:       +63.0°C  (high = +105.0°C, crit = +105.0°C)
>
>
>   3.12.13 -- 20140310 -- 00:26 -- good
> ==============================
> # sensors
> acpitz-virtual-0
> Adapter: Virtual device
> temp1:        +50.0°C  (crit = +256.0°C)
> temp2:        +70.0°C  (crit = +110.0°C)
> temp3:        +53.0°C  (crit = +105.0°C)
> temp4:        +25.6°C  (crit = +110.0°C)
> temp5:        +58.0°C  (crit = +110.0°C)
>
> coretemp-isa-0000
> Adapter: ISA adapter
> Core 0:       +65.0°C  (high = +105.0°C, crit = +105.0°C)
> Core 1:       +61.0°C  (high = +105.0°C, crit = +105.0°C)
>
> # EOF #
>
>

Hi, and thank you for your attention ^^

at the bottom of this email you'd get the actual values for the 
new 3.12.14 kernel for two different levels of usage and ambient 
temperature.
You'd read, in kernel 3.12.14 the /cdev?_trip_point enumeration 
has changed to the way of 3.13.? and also one /trip_point_?_temp 
did. But 3.12.14 is working as well as 3.12.13. (So my first 
eyecatcher didn't lead to useful things.)
I'm not capaple of finding or understanding the related code, 
but, please, let me present an idea of what MAY be going on:

In 3.12.13+, on my system, the effective cooling fan speed seems 
to be an accumulation, maybe bitwise, of 
cooling_device[2-6]/cur_state, that each get activated (=1) by a 
certain other temperature value or level; each of the 
cooling_device[2-6]/cur_state stays @1 as long as their ref. 
temp. does not undershoot. For my system this ref. temp.  would 
most likely be triggered by temp2 == thermal_zone1/temp [CPUZ].

In 3.13.? there seems to get only one of 
cooling_device[2-6]/cur_state be set to 1, the others left and/or 
rewritten with 0. And the fan speed algorithm then accumulates 
only one 1 without seeing the [_LEVEL_] number of 
cooling_device[2-6]... or re-requesting the related trigger 
temperature.

I hope this leads you developers nearer to a conclusion on how to 
fix it,
best regards, Manuel Krause

_____________________________
3.12.14 -- 20140311 -- 19:07 -- changed, not broken -- normal use
=============================
/sys/class/thermal/*  which
are links to -> ../../devices/virtual/thermal/*

dir             |-
                  /type       /cur_state  /max_state  Maybe
                                                       trigger
                                                       /PWM
...
cooling_device2  Fan          0           1          not yet
                                                       observed
cooling_device3  Fan          0           1          FDTZ==58°C
cooling_device4  Fan          1           1          FDTZ==45°C
cooling_device5  Fan          1           1          FDTZ==34°C
cooling_device6  Fan          1           1          FDTZ==25°C
...

dir          |-
               /passive /temp  |-     /cdev?_  /trip_   /trip_
                                       trip_    point_   point_
                                       point    ?_temp   ?_type
...
thermal_zone1   n.a.    73000 |- 
(CPUZ)
                                 ?=0   6       110000   critical
                                 ?=1   5       107000   passive
                                 ?=2   4        90000   active
                                 ?=3   3        75000   active
                                 ?=4   2        55000   active
                                 ?=5   1        45000   active
                                 ?=6   1        30000   active
...
thermal_zone4   n.a.    45000   ?=0    n.a.   110000   critical 
(FDTZ)
...

# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +46.0°C  (crit = +256.0°C)
temp2:        +73.0°C  (crit = +110.0°C)
temp3:        +57.0°C  (crit = +105.0°C)
temp4:        +26.3°C  (crit = +110.0°C)
temp5:        +45.0°C  (crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +68.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +66.0°C  (high = +105.0°C, crit = +105.0°C)


_____________________________
3.12.14 -- 20140311 -- 21:09 -- changed, not broken -- idle state
=============================

dir             |-
                  /type       /cur_state  /max_state  Maybe
                                                       trigger
                                                       /PWM
...
cooling_device2  Fan          0           1          not yet
                                                       observed
cooling_device3  Fan          0           1          FDTZ==58°C
cooling_device4  Fan          0           1          FDTZ==45°C
cooling_device5  Fan          0           1          FDTZ==34°C
cooling_device6  Fan          1           1          FDTZ==25°C
...

dir          |-
               /passive /temp
thermal_zone1   n.a.    46000 ... (CPUZ)
...
thermal_zone4   n.a.    25000 ... (FDTZ)
...

# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +50.0°C  (crit = +256.0°C)
temp2:        +46.0°C  (crit = +110.0°C)
temp3:        +44.0°C  (crit = +105.0°C)
temp4:        +25.7°C  (crit = +110.0°C)
temp5:        +25.0°C  (crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +41.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +41.0°C  (high = +105.0°C, crit = +105.0°C)
_____________________________


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ