lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <594f5012-5157-bfdd-f715-bea8caa824c3@lwfinger.net>
Date:   Mon, 26 Sep 2016 17:09:59 -0500
From:   Larry Finger <Larry.Finger@...inger.net>
To:     "Rafael J. Wysocki" <rafael@...nel.org>
Cc:     "Rafael J. Wysocki" <rjw@...ysocki.net>,
        LKML <linux-kernel@...r.kernel.org>,
        Linux PM list <linux-pm@...r.kernel.org>,
        Srinivas Pandruvada <srinivas.pandruvada@...ux.intel.com>
Subject: Re: Regression in 4.8 - CPU speed set very low

On 09/26/2016 04:37 PM, Rafael J. Wysocki wrote:
> On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
> <Larry.Finger@...inger.net> wrote:
>> On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:
>>>
>>> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>>>>
>>>> On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
>>>>>
>>>>> On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
>>>>>>
>>>>>> On 09/18/2016 09:54 PM, Larry Finger wrote:
>>>>>>>
>>>>>>> On 09/14/2016 11:00 AM, Larry Finger wrote:
>>>>>>>>
>>>>>>>> On 09/09/2016 12:39 PM, Larry Finger wrote:
>>>>>>>>>
>>>>>>>>> I have found a regression in kernel 4.8-rc2 that causes the speed of
>>>>>>>>> my laptop
>>>>>>>>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a
>>>>>>>>> maximum cpu
>>>>>>>>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger
>>>>>>>>> this problem,
>>>>>>>>> thus a bisection is not possible. It usually happens under heavy
>>>>>>>>> load, such as a
>>>>>>>>> kernel build or the RPM build of VirtualBox, but it does not always
>>>>>>>>> fail with
>>>>>>>>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu
>>>>>>>>> MHz of
>>>>>>>>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock
>>>>>>>>> before the
>>>>>>>>> fault is 3437 MHz. Nothing is logged when this happens.
>>>>>>>>>
>>>>>>>>> If I were to get a patch that would show a backtrace when the
>>>>>>>>> maximum CPU
>>>>>>>>> frequency is changed, perhaps it would be possible to track this
>>>>>>>>> bug.
>>>>>>>>
>>>>>>>>
>>>>>>>> I have not yet found the bad commit, but I have reduced the range of
>>>>>>>> commits a
>>>>>>>> bit. This bug has been difficult to trigger. So far, it has not taken
>>>>>>>> over 1/2
>>>>>>>> day to appear in bad kernels, thus I am allowing three days before
>>>>>>>> deciding that
>>>>>>>> a given trial is good. I never saw the problem with 4.7 kernels, but
>>>>>>>> I did in
>>>>>>>> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit
>>>>>>>> 1b05cf6 did
>>>>>>>> not show the bug.
>>>>>>>>
>>>>>>>> Testing continues.
>>>>>>>
>>>>>>>
>>>>>>> And still does. My bisection seemed to be trending toward an
>>>>>>> improbable set of
>>>>>>> commits, and I needed to do some other work with the machine, thus I
>>>>>>> started
>>>>>>> running 4.8-rc6. It failed nearly 48 hours after the reboot, which
>>>>>>> indicated
>>>>>>> that using 3 days to indicate a "good" trial was likely too short. I
>>>>>>> am
>>>>>>> currently testing the first of the trial and will run it for at least
>>>>>>> a week. It
>>>>>>> is unlikely that these tests will be complete before 4,8 is released,
>>>>>>> even if
>>>>>>> -rc8 is needed. I will keep attempting to find the faulty commit.
>>>>>>
>>>>>>
>>>>>> My debugging continues. After 7 days of beating on commit f7816ad, I
>>>>>> have
>>>>>> concluded that it is likely good. Thus I think the bug lies between
>>>>>> commit
>>>>>> 581e0cd (bad) and f7816ad (good). I will need to do a long test on
>>>>>> commit
>>>>>> 1b05cf6, which did not fail with a shorter run.
>>>>>
>>>>>
>>>>> 581e0cd is not a valid mainline commit hash AFAICS.
>>>>
>>>>
>>>> That was a typo. The correct value is 581e0c7.
>>>>>
>>>>>
>>>>> What cpufreq driver do you use?
>>>>
>>>>
>>>> My "Default CPUFreq governor" is on demand.
>>>>
>>>> Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in
>>>>
>>>> CONFIG_ACPI_CPU_FREQ_PSS=y
>>>> CONFIG_CPU_FREQ=y
>>>> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
>>>> CONFIG_CPU_FREQ_GOV_COMMON=y
>>>> # CONFIG_CPU_FREQ_STAT is not set
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
>>>> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
>>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
>>>> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
>>>> CONFIG_CPU_FREQ_GOV_POWERSAVE=m
>>>> CONFIG_CPU_FREQ_GOV_USERSPACE=m
>>>> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
>>>> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
>>>> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
>>>> CONFIG_X86_PCC_CPUFREQ=m
>>>> CONFIG_X86_ACPI_CPUFREQ=m
>>>> CONFIG_X86_ACPI_CPUFREQ_CPB=y
>>>>
>>>> Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up
>>>> going
>>>> wrong. Further tests have shown that commit 351a4ded is bad. Once again,
>>>> by
>>>> bisection seems to be converging to a set of commits that seem unlikely
>>>> to cause
>>>> this problem. Perhaps commit f7816ad is not really good even though it
>>>> survived
>>>> 7 days of heavy CPU usage.
>>>>
>>>> I have been reluctant to post my entire .config on the list. It is
>>>> available at
>>>> http://pastebin.com/aMZaAKwL.
>>>
>>>
>>> If the governor is ondemand, the driver is acpi-cpufreq, most likely.
>>>
>>> How do you measure the frequency?
>>
>>
>> Mostly I use a KDE applet named "System load" and look at the "average
>> clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
>> When the bug triggers, the system gets very slow, and the cpu fan stops even
>> though the cpu is still busy.
>
> That sounds like thermal throttling kicking in.

I think it is because the cpu is idling. If a thermal throttling is responsible, 
why would it not fail for 168 hours, and then fail in 2?

> What's there under /sys/class/thermal/ on your system?

It contains the following directories:

cooling_device0  cooling_device1  cooling_device2  cooling_device3 
cooling_device4  thermal_zone0  thermal_zone1
>
>> Commit f7816ad, which had run for 7 days without showing the bug, failed
>> after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
>> well, that's the way it goes!
>
> Are you confident that the issue was not reproducible before 4.8-rc2?
> In particular, what about 4.8-rc1?

4.8-rc1 is definitely bad. I am now testing commit 5539204. In the bisect 
visualization, there are a number of cpufreq commits before the test case. If it 
is one of them, it may be a while before I dare call this one "good". In one 
respect, that is good as I will be traveling tomorrow and Wednesday.

Larry


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ