lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 10 Mar 2014 02:49:46 +0100
From:	Manuel Krause <manuelkrause@...scape.net>
To:	"Rafael J. Wysocki" <rjw@...ysocki.net>,
	linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org
CC:	Guenter Roeck <linux@...ck-us.net>,
	Jean Delvare <jdelvare@...e.de>, lm-sensors@...sensors.org,
	rui.zhang@...el.com
Subject: Re: 3.13.?: Strange / dangerous fan policy...

On 2014-03-09 18:58, Rafael J. Wysocki wrote:
> On Sunday, March 09, 2014 01:10:25 AM Manuel Krause wrote:
>> On 2014-03-08 16:59, Guenter Roeck wrote:
>>> On 03/08/2014 03:08 AM, Jean Delvare wrote:
>>>> On Fri, 7 Mar 2014 14:52:30 -0800, Guenter Roeck wrote:
>>>>> On Fri, Mar 07, 2014 at 11:04:29PM +0100, Manuel Krause wrote:
>>>>>> Hi, and thanks for the quick response!
>>>>>> No special fancy "fan control policy". 'fancontrol' isn't up or
>>>>>> running.
>>>>>> Vanilla kernels 3.11.* and 3.12.* had been working on here
>>>>>> without
>>>>>> any extra work.
>>>>>> --
>>>>>> # sensors
>>>>>> acpitz-virtual-0
>>>>>> Adapter: Virtual device
>>>>>> temp1:        +71.0°C  (crit = +256.0°C)
>>>>>> temp2:        +69.0°C  (crit = +110.0°C)
>>>>>> temp3:        +52.0°C  (crit = +105.0°C)
>>>>>> temp4:        +25.0°C  (crit = +110.0°C)
>>>>>> temp5:        +58.0°C  (crit = +110.0°C)
>>>>>>
>>>>>> coretemp-isa-0000
>>>>>> Adapter: ISA adapter
>>>>>> Core 0:       +62.0°C  (high = +105.0°C, crit = +105.0°C)
>>>>>> Core 1:       +60.0°C  (high = +105.0°C, crit = +105.0°C)
>>>>>> --
>>>>>> My notebook (HP/Compaq 6730b) does not have a seperate fan
>>>>>> sensor.
>>>>>> This is with 3.12.13 with my normal workload.
>>>>>>
>>>>>> Please, trust my above mentionned values of 94 °C vs. 74°C as I
>>>>>> don't like to boot 3.13.6 anymore, to avoid harm to the
>>>>>> notebook's
>>>>>> casing.
>>>>>
>>>>> Understood. Unfortunately, we'll need to get information
>>>>> from the new kernel to be able to track down the problem.
>>>>
>>>> Indeed. Not only the run-time temperatures, but also the high
>>>> and crit
>>>> limits.
>>>>
>>>>>> But I'd do to test any improvement-patch.
>>>>>
>>>>> So far I have no idea what is going on. I don't see anything
>>>>> in the
>>>>> drivers providing above data that would explain the behavior,
>>>>> but I might be missing something.
>>>>
>>>> Looks like a regression in the acpi subsystem or in power
>>>> management,
>>>> not hwmon. Hwmon is merely reporting the temperatures, it's not
>>>> responsible for the actual temperatures.
>>>>
>>>
>>> I would agree. I don't think we have enough information to be sure,
>>> though. There might be some unintended interaction or interference.
>>>
>>> gpu is a good hint ... for example, look at commit b9ed919f1c8
>>> (drm/nouveau/drm/pm: remove everything except the hwmon interfaces
>>> to THERM). nouveau does export pwm and fan control information,
>>> so any change in that code may have unintended side effects.
>>> Similar, I don't know how ec39f64bba (drm/radeon/dpm: Convert to
>>> use devm_hwmon_register_with_groups) could have the observed impact,
>>> as it is purely passive, but I prefer to be rather safe than sorry.
>>>
>>> This problem has now been submitted into bugzilla as
>>> https://bugzilla.kernel.org/show_bug.cgi?id=71711.
>>>
>>> Guenter
>>>
>>
>> Sorry, for beeing late, had to search for/accumulate much info
>> for you...
>> I hope, you like me to put it into one answer to you all CCing you.
>>
>> My GFX is a GM45 Intel (mobile), shared memory, running the
>> opensource Mesa drivers/extensions.
>> kernel-module: i915
>>
>> According to the output of 'cpupower': I have
>> CPUidle driver: acpi_idle
>> CPUidle governor: menu
>>
>> CPUfreq:
>>     driver: acpi-cpufreq
>>     available cpufreq governors: ondemand, performance
>> -
>> And "ondemand" is running.
>> --
>>
>> # sensors
>> acpitz-virtual-0
>> Adapter: Virtual device
>> temp1:        +41.0°C  (crit = +256.0°C)
>> temp2:        +92.0°C  (crit = +110.0°C)
>> temp3:        +71.0°C  (crit = +105.0°C)
>> temp4:        +26.5°C  (crit = +110.0°C)
>> temp5:        +25.0°C  (crit = +110.0°C)
>>
>> coretemp-isa-0000
>> Adapter: ISA adapter
>> Core 0:       +86.0°C  (high = +105.0°C, crit = +105.0°C)
>> Core 1:       +84.0°C  (high = +105.0°C, crit = +105.0°C)
>>
>> FROM a critical "smelly" situation today, kernel-compilation, fan
>> @100%.
>> --
>>
>> Additional findings:
>>
>> Identification from bootup ACPI initialisation vs. sensors:
>> temp1 = DTSZ
>> temp2 = CPUZ --> triggering Cooling in 3.12.13 if > 74°C
>> temp3 = SKNZ
>> temp4 = BATZ "Battery Zone" always calm ~ +6°C of ambient T
>> temp5 = FDTZ --- in 3.12.13 a representation of the cooling-fan
>> (25 - 45 - 58 - max?)
>> Core 0 & Core 1 are the internal CPU T sensors.
>>
>> With the 3.13.x (.5+) kernels the first gatherered cooling
>> settings from bootup do stay forever. Means, rebooting a hot
>> system will get a FDTZ @45°C+ and won't make any problems, as it
>> does cool enough (even for kernel compiling on here). If it gets
>> 25°C @bootup the system goes into emergency cooling somewhen.
>> Same is with a suspend/resume.
>>
>> Kernel 3.12.13 adjusts the cooling on it's own, but appropriately.
>
> This almost certainly is an ACPI regression, but I'm not sure whether
> thermal management or CPU power management is broken on your system.
>
> Can you compare the contents of /sys/class/thermal/ from working and
> not working kernels, please?
>
> Rafael
>

Hi again,
unfortunately you didn't specify how deeply I should dig into 
/sys/class/thermal. So you get the lines from # BOF # to # EOF # 
below. I hope they're readable without more comments.

The most remarkable changes, in my eyes, had happened within 
"thermal_zone1".

Best regards,
Manuel Krause


# BOF #
Following ones are all from /sys/class/thermal/ which are links 
to -> ../../devices/virtual/thermal/

I've listed the directories in sections of cooling_devices and 
thermal_zones separately for each bad/good kernel. For Emailing 
purposes only. You can merge them into a spreadsheet for your 
evaluation on your own. I've left out reporting some subdirs and 
subdir's values that _really_ didn't seem to need attention.

Also, I've had collected the #sensors output for each readout, 
having reproduced nearly the same workload, represented by the 
"Fan speed" (thermal_zone4==FDTZ).

And I've done my very best to not produce typos or c&p errors.


  3.13.5 -- 20140309 -- 20:52 -- bad
=============================
dir             |-
                  /type       /cur_state  /max_state
cooling_device0  Processor    0          10
cooling_device1  Processor    0          10
cooling_device2  Fan          0           1
cooling_device3  Fan          1           1
cooling_device4  Fan          0           1
cooling_device5  Fan          0           1
cooling_device6  Fan          0           1
cooling_device7  LCD          0          24

  3.12.13 -- 20140310 -- 00:26 -- good
==============================
dir             |-
                  /type       /cur_state  /max_state
cooling_device0  Processor    0          10
cooling_device1  Processor    0          10
cooling_device2  Fan          0           1
cooling_device3  Fan          1           1
cooling_device4  Fan          1           1
cooling_device5  Fan          1           1
cooling_device6  Fan          1           1
cooling_device7  LCD          0          24


  3.13.5 -- 20140309 -- 20:52 -- bad
=============================
dir          |-
               /passive /temp  |-     /cdev?_  /trip_   /trip_
                                       trip_    point_   point_
                                       point    ?_temp   ?_type
thermal_zone0  0        68000   ?=0    n.a.   256000   critical
thermal_zone1   n.a.    70000 |-
                                 ?=0   6       110000   critical
                                 ?=1   5       107000   passive
                                 ?=2   4        90000   active
                                 ?=3   3        75000   active
                                 ?=4   2        55000   active
                                 ?=5   1        45000   active
                                 ?=6   1        30000   active
thermal_zone2   n.a.    54000 |-
                                 ?=0   1       105000   critical
                                 ?=1   1        95000   passive
thermal_zone3   n.a.    25800 |-
                                 ?=0   1       110000   critical
                                 ?=1   1        60000   passive
thermal_zone4  0        58000   ?=0    n.a.   110000   critical


  3.12.13 -- 20140310 -- 00:26 -- good
==============================
dir          |-
               /passive /temp  |-     /cdev?_  /trip_   /trip_
                                       trip_    point_   point_
                                       point    ?_temp   ?_type
thermal_zone0  0        50000   ?=0    n.a.   256000   critical
thermal_zone1   n.a.    70000 |-
                                 ?=0   1       110000   critical
                                 ?=1   1       107000   passive
                                 ?=2   2        90000   active
                                 ?=3   3        67000   active
                                 ?=4   4        55000   active
                                 ?=5   5        45000   active
                                 ?=6   6        30000   active
thermal_zone2   n.a.    53000 |-
                                 ?=0   1       105000   critical
                                 ?=1   1        95000   passive
thermal_zone3   n.a.    25600 |-
                                 ?=0   1       110000   critical
                                 ?=1   1        60000   passive
thermal_zone4  0        58000   ?=0    n.a.   110000   critical

---
Legend here:
        /type  is always  acpitz
        /mode             enabled
        /policy           step_wise

       - from kernel ACPI initialisation: thermal_zone0==DTSZ,
          thermal_zone1==CPUZ, thermal_zone2==SKNZ,
          thermal_zone3==BATZ, thermal_zone4==FDTZ
       - n.a. means      file or value is not available
___
Legend in general:
              /power/control          is always  auto
              /power/runtime_status              unsupported
              /uevent                            ''==empty

----------------------------------------------------------------

  3.13.5 -- 20140309 -- 20:52 -- bad
=============================
# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +68.0°C  (crit = +256.0°C)
temp2:        +70.0°C  (crit = +110.0°C)
temp3:        +54.0°C  (crit = +105.0°C)
temp4:        +25.8°C  (crit = +110.0°C)
temp5:        +58.0°C  (crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +66.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +63.0°C  (high = +105.0°C, crit = +105.0°C)


  3.12.13 -- 20140310 -- 00:26 -- good
==============================
# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +50.0°C  (crit = +256.0°C)
temp2:        +70.0°C  (crit = +110.0°C)
temp3:        +53.0°C  (crit = +105.0°C)
temp4:        +25.6°C  (crit = +110.0°C)
temp5:        +58.0°C  (crit = +110.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +65.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +61.0°C  (high = +105.0°C, crit = +105.0°C)

# EOF #


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ