[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <531D1A3A.4040500@netscape.net>
Date: Mon, 10 Mar 2014 02:49:46 +0100
From: Manuel Krause <manuelkrause@...scape.net>
To: "Rafael J. Wysocki" <rjw@...ysocki.net>,
linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org
CC: Guenter Roeck <linux@...ck-us.net>,
Jean Delvare <jdelvare@...e.de>, lm-sensors@...sensors.org,
rui.zhang@...el.com
Subject: Re: 3.13.?: Strange / dangerous fan policy...
On 2014-03-09 18:58, Rafael J. Wysocki wrote:
> On Sunday, March 09, 2014 01:10:25 AM Manuel Krause wrote:
>> On 2014-03-08 16:59, Guenter Roeck wrote:
>>> On 03/08/2014 03:08 AM, Jean Delvare wrote:
>>>> On Fri, 7 Mar 2014 14:52:30 -0800, Guenter Roeck wrote:
>>>>> On Fri, Mar 07, 2014 at 11:04:29PM +0100, Manuel Krause wrote:
>>>>>> Hi, and thanks for the quick response!
>>>>>> No special fancy "fan control policy". 'fancontrol' isn't up or
>>>>>> running.
>>>>>> Vanilla kernels 3.11.* and 3.12.* had been working on here
>>>>>> without
>>>>>> any extra work.
>>>>>> --
>>>>>> # sensors
>>>>>> acpitz-virtual-0
>>>>>> Adapter: Virtual device
>>>>>> temp1: +71.0°C (crit = +256.0°C)
>>>>>> temp2: +69.0°C (crit = +110.0°C)
>>>>>> temp3: +52.0°C (crit = +105.0°C)
>>>>>> temp4: +25.0°C (crit = +110.0°C)
>>>>>> temp5: +58.0°C (crit = +110.0°C)
>>>>>>
>>>>>> coretemp-isa-0000
>>>>>> Adapter: ISA adapter
>>>>>> Core 0: +62.0°C (high = +105.0°C, crit = +105.0°C)
>>>>>> Core 1: +60.0°C (high = +105.0°C, crit = +105.0°C)
>>>>>> --
>>>>>> My notebook (HP/Compaq 6730b) does not have a seperate fan
>>>>>> sensor.
>>>>>> This is with 3.12.13 with my normal workload.
>>>>>>
>>>>>> Please, trust my above mentionned values of 94 °C vs. 74°C as I
>>>>>> don't like to boot 3.13.6 anymore, to avoid harm to the
>>>>>> notebook's
>>>>>> casing.
>>>>>
>>>>> Understood. Unfortunately, we'll need to get information
>>>>> from the new kernel to be able to track down the problem.
>>>>
>>>> Indeed. Not only the run-time temperatures, but also the high
>>>> and crit
>>>> limits.
>>>>
>>>>>> But I'd do to test any improvement-patch.
>>>>>
>>>>> So far I have no idea what is going on. I don't see anything
>>>>> in the
>>>>> drivers providing above data that would explain the behavior,
>>>>> but I might be missing something.
>>>>
>>>> Looks like a regression in the acpi subsystem or in power
>>>> management,
>>>> not hwmon. Hwmon is merely reporting the temperatures, it's not
>>>> responsible for the actual temperatures.
>>>>
>>>
>>> I would agree. I don't think we have enough information to be sure,
>>> though. There might be some unintended interaction or interference.
>>>
>>> gpu is a good hint ... for example, look at commit b9ed919f1c8
>>> (drm/nouveau/drm/pm: remove everything except the hwmon interfaces
>>> to THERM). nouveau does export pwm and fan control information,
>>> so any change in that code may have unintended side effects.
>>> Similar, I don't know how ec39f64bba (drm/radeon/dpm: Convert to
>>> use devm_hwmon_register_with_groups) could have the observed impact,
>>> as it is purely passive, but I prefer to be rather safe than sorry.
>>>
>>> This problem has now been submitted into bugzilla as
>>> https://bugzilla.kernel.org/show_bug.cgi?id=71711.
>>>
>>> Guenter
>>>
>>
>> Sorry, for beeing late, had to search for/accumulate much info
>> for you...
>> I hope, you like me to put it into one answer to you all CCing you.
>>
>> My GFX is a GM45 Intel (mobile), shared memory, running the
>> opensource Mesa drivers/extensions.
>> kernel-module: i915
>>
>> According to the output of 'cpupower': I have
>> CPUidle driver: acpi_idle
>> CPUidle governor: menu
>>
>> CPUfreq:
>> driver: acpi-cpufreq
>> available cpufreq governors: ondemand, performance
>> -
>> And "ondemand" is running.
>> --
>>
>> # sensors
>> acpitz-virtual-0
>> Adapter: Virtual device
>> temp1: +41.0°C (crit = +256.0°C)
>> temp2: +92.0°C (crit = +110.0°C)
>> temp3: +71.0°C (crit = +105.0°C)
>> temp4: +26.5°C (crit = +110.0°C)
>> temp5: +25.0°C (crit = +110.0°C)
>>
>> coretemp-isa-0000
>> Adapter: ISA adapter
>> Core 0: +86.0°C (high = +105.0°C, crit = +105.0°C)
>> Core 1: +84.0°C (high = +105.0°C, crit = +105.0°C)
>>
>> FROM a critical "smelly" situation today, kernel-compilation, fan
>> @100%.
>> --
>>
>> Additional findings:
>>
>> Identification from bootup ACPI initialisation vs. sensors:
>> temp1 = DTSZ
>> temp2 = CPUZ --> triggering Cooling in 3.12.13 if > 74°C
>> temp3 = SKNZ
>> temp4 = BATZ "Battery Zone" always calm ~ +6°C of ambient T
>> temp5 = FDTZ --- in 3.12.13 a representation of the cooling-fan
>> (25 - 45 - 58 - max?)
>> Core 0 & Core 1 are the internal CPU T sensors.
>>
>> With the 3.13.x (.5+) kernels the first gatherered cooling
>> settings from bootup do stay forever. Means, rebooting a hot
>> system will get a FDTZ @45°C+ and won't make any problems, as it
>> does cool enough (even for kernel compiling on here). If it gets
>> 25°C @bootup the system goes into emergency cooling somewhen.
>> Same is with a suspend/resume.
>>
>> Kernel 3.12.13 adjusts the cooling on it's own, but appropriately.
>
> This almost certainly is an ACPI regression, but I'm not sure whether
> thermal management or CPU power management is broken on your system.
>
> Can you compare the contents of /sys/class/thermal/ from working and
> not working kernels, please?
>
> Rafael
>
Hi again,
unfortunately you didn't specify how deeply I should dig into
/sys/class/thermal. So you get the lines from # BOF # to # EOF #
below. I hope they're readable without more comments.
The most remarkable changes, in my eyes, had happened within
"thermal_zone1".
Best regards,
Manuel Krause
# BOF #
Following ones are all from /sys/class/thermal/ which are links
to -> ../../devices/virtual/thermal/
I've listed the directories in sections of cooling_devices and
thermal_zones separately for each bad/good kernel. For Emailing
purposes only. You can merge them into a spreadsheet for your
evaluation on your own. I've left out reporting some subdirs and
subdir's values that _really_ didn't seem to need attention.
Also, I've had collected the #sensors output for each readout,
having reproduced nearly the same workload, represented by the
"Fan speed" (thermal_zone4==FDTZ).
And I've done my very best to not produce typos or c&p errors.
3.13.5 -- 20140309 -- 20:52 -- bad
=============================
dir |-
/type /cur_state /max_state
cooling_device0 Processor 0 10
cooling_device1 Processor 0 10
cooling_device2 Fan 0 1
cooling_device3 Fan 1 1
cooling_device4 Fan 0 1
cooling_device5 Fan 0 1
cooling_device6 Fan 0 1
cooling_device7 LCD 0 24
3.12.13 -- 20140310 -- 00:26 -- good
==============================
dir |-
/type /cur_state /max_state
cooling_device0 Processor 0 10
cooling_device1 Processor 0 10
cooling_device2 Fan 0 1
cooling_device3 Fan 1 1
cooling_device4 Fan 1 1
cooling_device5 Fan 1 1
cooling_device6 Fan 1 1
cooling_device7 LCD 0 24
3.13.5 -- 20140309 -- 20:52 -- bad
=============================
dir |-
/passive /temp |- /cdev?_ /trip_ /trip_
trip_ point_ point_
point ?_temp ?_type
thermal_zone0 0 68000 ?=0 n.a. 256000 critical
thermal_zone1 n.a. 70000 |-
?=0 6 110000 critical
?=1 5 107000 passive
?=2 4 90000 active
?=3 3 75000 active
?=4 2 55000 active
?=5 1 45000 active
?=6 1 30000 active
thermal_zone2 n.a. 54000 |-
?=0 1 105000 critical
?=1 1 95000 passive
thermal_zone3 n.a. 25800 |-
?=0 1 110000 critical
?=1 1 60000 passive
thermal_zone4 0 58000 ?=0 n.a. 110000 critical
3.12.13 -- 20140310 -- 00:26 -- good
==============================
dir |-
/passive /temp |- /cdev?_ /trip_ /trip_
trip_ point_ point_
point ?_temp ?_type
thermal_zone0 0 50000 ?=0 n.a. 256000 critical
thermal_zone1 n.a. 70000 |-
?=0 1 110000 critical
?=1 1 107000 passive
?=2 2 90000 active
?=3 3 67000 active
?=4 4 55000 active
?=5 5 45000 active
?=6 6 30000 active
thermal_zone2 n.a. 53000 |-
?=0 1 105000 critical
?=1 1 95000 passive
thermal_zone3 n.a. 25600 |-
?=0 1 110000 critical
?=1 1 60000 passive
thermal_zone4 0 58000 ?=0 n.a. 110000 critical
---
Legend here:
/type is always acpitz
/mode enabled
/policy step_wise
- from kernel ACPI initialisation: thermal_zone0==DTSZ,
thermal_zone1==CPUZ, thermal_zone2==SKNZ,
thermal_zone3==BATZ, thermal_zone4==FDTZ
- n.a. means file or value is not available
___
Legend in general:
/power/control is always auto
/power/runtime_status unsupported
/uevent ''==empty
----------------------------------------------------------------
3.13.5 -- 20140309 -- 20:52 -- bad
=============================
# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1: +68.0°C (crit = +256.0°C)
temp2: +70.0°C (crit = +110.0°C)
temp3: +54.0°C (crit = +105.0°C)
temp4: +25.8°C (crit = +110.0°C)
temp5: +58.0°C (crit = +110.0°C)
coretemp-isa-0000
Adapter: ISA adapter
Core 0: +66.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +63.0°C (high = +105.0°C, crit = +105.0°C)
3.12.13 -- 20140310 -- 00:26 -- good
==============================
# sensors
acpitz-virtual-0
Adapter: Virtual device
temp1: +50.0°C (crit = +256.0°C)
temp2: +70.0°C (crit = +110.0°C)
temp3: +53.0°C (crit = +105.0°C)
temp4: +25.6°C (crit = +110.0°C)
temp5: +58.0°C (crit = +110.0°C)
coretemp-isa-0000
Adapter: ISA adapter
Core 0: +65.0°C (high = +105.0°C, crit = +105.0°C)
Core 1: +61.0°C (high = +105.0°C, crit = +105.0°C)
# EOF #
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists